Real-Time Twitter Corpus Labelling Using Automatic Clustering Approach

Gupta, Itisha; Joshi, Nisheeth

doi:http://dx.doi.org/10.12785/ijcds/100150

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Volume 10
→
Issue 01
→
View Item

dc.contributor.author	Gupta, Itisha
dc.contributor.author	Joshi, Nisheeth
dc.date.accessioned	2020-07-21T14:33:12Z
dc.date.available	2020-07-21T14:33:12Z
dc.date.issued	2021-04-21
dc.identifier.issn	2210-142X
dc.identifier.uri	https://journal.uob.edu.bh:443/handle/123456789/4039
dc.description.abstract	In this paper, we present a novel automatic labelling approach for the large amount of unlabelled real-time twitter datasets for textual-based twitter sentiment analysis. The tweets are labelled or classified as Positive, Negative or Neutral using the novel automatic approach. The proposed approach applies an unsupervised clustering technique that would generate clusters based on the underlying patterns (finding similarities between tweets) in the collected twitter corpus. Twitter search API is used to collect realtime English tweets on several topics such as “#Demonetization”, “#lockdown”, and “#9pm9minutes” by the use of search operator. To analyse the sentiment from real-time tweets, labelling of the corpus is required. Manual annotation of large twitter corpus is time and labor-intensive. Moreover, domain experts are needed for labelling of tweets belonging to a particular domain. Thus, in this work, we propose the use of the K-mean clustering approach, which is an unsupervised way of labelling corpus, which could then be used for learning supervised models such as SVM for sentiment analysis. To make the corpus ready for clustering and to get quality clusters, we have applied some basic to advanced cleaning operations known as tweet normalization. Furthermore, we perform extensive feature engineering to generate different types of features including POS-based (Part-of-Speech), n-grams, twitter-specific, and lexicon-based features from our collected unlabelled twitter corpus. Those features act as input to the K-mean clustering algorithm and help it in identifying patterns from the data for cluster generation. In the end, cluster analysis is done manually to find out the sentiments expressing by tweets in a particular cluster. Accordingly, cluster classification is done and each cluster is assigned one class that is Positive, Negative, or Neutral. The main contribution of this work is the idea of amalgamation of extensive feature engineering with the unsupervised clustering approach for classification of large unlabelled twitter corpus.	en_US
dc.language.iso	en	en_US
dc.publisher	University of Bahrain	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Feature Engineering	en_US
dc.subject	Cluster Analysis	en_US
dc.subject	Corpus Labelling	en_US
dc.subject	Real-Time Tweets	en_US
dc.subject	Twitter Sentiment Analysis	en_US
dc.subject	Pre-processing	en_US
dc.title	Real-Time Twitter Corpus Labelling Using Automatic Clustering Approach	en_US
dc.type	Article	en_US
dc.identifier.doi	http://dx.doi.org/10.12785/ijcds/100150
dc.volume	10	en_US
dc.pagestart	519	en_US
dc.pageend	532	en_US
dc.source.title	International Journal of Computing and Digital Systems	en_US
dc.abbreviatedsourcetitle	IJCDS	en_US