From Data to Insight: Topic Modeling and Automatic Labeling Strategies

F. Najeeb, Rana; N. Dhannoon, Ban; Qais Alkhalidi, Farah

doi:http://dx.doi.org/10.12785/ijcds/XXXXXX

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Preprint
→
View Item

dc.contributor.author	F. Najeeb, Rana
dc.contributor.author	N. Dhannoon, Ban
dc.contributor.author	Qais Alkhalidi, Farah
dc.date.accessioned	2024-04-26T16:21:19Z
dc.date.available	2024-04-26T16:21:19Z
dc.date.issued	2024-04-26
dc.identifier.issn	2210-142X
dc.identifier.uri	https://journal.uob.edu.bh:443/handle/123456789/5627
dc.description.abstract	Researchers usually present and synthesize their findings in scientific publications. For this reason, it is essential to analyze their substance to understand a subject. This study suggests improving the topic modeling in a collection of conference papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017. Two goals of this study were achieved: producing more coherent topics and topic automatic labeling. The first goal was achieved through five phases, text pre-processing phase, reduction phase using a new method called RS-LW (Reduced Sentences Based on Length and Weight), which removes the sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less weight sentences. Sentence embedding phase using S-BERT (Sentence-Bidirectional Encoder Representation from Transformer), Reducing the dimensionality of the sentences embedding phase by utilizing UMAP (Uniform Manifold Approximation and Projection). Lastly, the use of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to organize comparable documents. The experimental findings demonstrate that the use of the proposed RS-LW phase has produced more cohesive topics. This has led to improvements in topic coherence by (0.593), and topic diversity performance by (0.96). Though topic modeling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors profile in Google Scholar and extracting the interests for use in automatically labeling the topics.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Bahrain	en_US
dc.subject	Deep Learning, Topic Modelling, Automatic Topic Labeling, S-BERT, Pre-trained Language Model.	en_US
dc.title	From Data to Insight: Topic Modeling and Automatic Labeling Strategies	en_US
dc.identifier.doi	http://dx.doi.org/10.12785/ijcds/XXXXXX
dc.volume	16	en_US
dc.issue	1	en_US
dc.pagestart	1	en_US
dc.pageend	10	en_US
dc.contributor.authorcountry	Iraq	en_US
dc.contributor.authorcountry	Iraq	en_US
dc.contributor.authorcountry	Iraq	en_US
dc.contributor.authoraffiliation	Computer Science of Mustansiriyah University	en_US
dc.contributor.authoraffiliation	Computer Science of Al-Nahrain University	en_US
dc.contributor.authoraffiliation	Computer Science of Mustansiriyah University	en_US
dc.source.title	International Journal of Computing and Digital Systems	en_US
dc.abbreviatedsourcetitle	IJCDS	en_US