University of Bahrain
Scientific Journals

Real-Time Twitter Corpus Labelling Using Automatic Clustering Approach

Show simple item record Gupta, Itisha Joshi, Nisheeth 2020-07-21T14:33:12Z 2020-07-21T14:33:12Z 2020-07-01
dc.identifier.issn 2210-142X
dc.description.abstract In this paper, we present a novel automatic labelling approach for the large amount of unlabelled real-time twitter datasets for textual-based twitter sentiment analysis. The tweets are labelled or classified as Positive, Negative or Neutral using the novel automatic approach. The proposed approach applies an unsupervised clustering technique that would generate clusters based on the underlying patterns (finding similarities between tweets) in the collected twitter corpus. Twitter search API is used to collect realtime English tweets on several topics such as “#Demonetization”, “#lockdown”, and “#9pm9minutes” by the use of search operator. To analyse the sentiment from real-time tweets, labelling of the corpus is required. Manual annotation of large twitter corpus is time and labor-intensive. Moreover, domain experts are needed for labelling of tweets belonging to a particular domain. Thus, in this work, we propose the use of the K-mean clustering approach, which is an unsupervised way of labelling corpus, which could then be used for learning supervised models such as SVM for sentiment analysis. To make the corpus ready for clustering and to get quality clusters, we have applied some basic to advanced cleaning operations known as tweet normalization. Furthermore, we perform extensive feature engineering to generate different types of features including POS-based (Part-of-Speech), n-grams, twitter-specific, and lexicon-based features from our collected unlabelled twitter corpus. Those features act as input to the K-mean clustering algorithm and help it in identifying patterns from the data for cluster generation. In the end, cluster analysis is done manually to find out the sentiments expressing by tweets in a particular cluster. Accordingly, cluster classification is done and each cluster is assigned one class that is Positive, Negative, or Neutral. The main contribution of this work is the idea of amalgamation of extensive feature engineering with the unsupervised clustering approach for classification of large unlabelled twitter corpus. en_US
dc.language.iso en en_US
dc.publisher University of Bahrain en_US
dc.rights Attribution-NonCommercial-NoDerivatives 4.0 International *
dc.rights.uri *
dc.subject Feature Engineering en_US
dc.subject Cluster Analysis en_US
dc.subject Corpus Labelling en_US
dc.subject Real-Time Tweets en_US
dc.subject Twitter Sentiment Analysis en_US
dc.subject Pre-processing en_US
dc.title Real-Time Twitter Corpus Labelling Using Automatic Clustering Approach en_US
dc.type Article en_US
dc.volume 9 en_US
dc.pagestart 1 en_US
dc.pageend 9 en_US
dc.source.title International Journal of Computing and Digital Systems en_US
dc.abbreviatedsourcetitle IJCDS en_US

Files in this item

The following license files are associated with this item:

This item appears in the following Issue(s)

Show simple item record

Attribution-NonCommercial-NoDerivatives 4.0 International Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International

All Journals

Advanced Search


Administrator Account