Listen to the voice of the public using Twitter data
“Data is the new oil” — this statement correctly sums up the importance of data in the present world scenario. Every major decision, be it in corporations, politicians, or even entertainment is made after carefully analyzing the data.
This post intends to walk you through analyzing the tweets by users either on a trending topic or for specific twitter handles. The tools we would be using are — Python, TwitterAPI, and R. Knowing this could be beneficial either in your academic project, business decisions or just analyzing as a timepass on what people’s opinions are on a movie, actor, sportsman, etc.
So, let us dive into text mining!
Data Collection:
Data collection is the first step of any data or text analytics process. Twitter provides TwitterAPI to let you access the tweets as per your requirements. In order to use the TwitterAPI, you should have a Twitter developer account.
Apply for a Twitter developer account here — https://developer.twitter.com/
I won’t get into the details of account creation as that is easily available anywhere on the internet.
A couple of pro tips –
1. It could take them between 1–2 weeks to approve your developer account, so be patient and plan your project accordingly
2. You’d be asked for filling website URL while registering your app for Twitter API. It can be a random URL. You don’t need a running website to register your app.
Once your Twitter developer account is approved, you get a consumer key, consumer secret, access key, and access secret. All these tokens are necessary for establishing connection to Twitter API.
Python has the tweepy library for accessing the Twitter API. You can install it simply by using the command — pip install tweepy (Google on installing libraries in Python if you’re unfamiliar).
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)auth.set_access_token(access_key, access_secret)api = tweepy.API(auth,wait_on_rate_limit=True)
The above code is used to establish connection to the Twitter API. Assign your tokens to the respective variables as given above.
csvFile = open(‘tweets_collected.txt’, ‘a’,encoding=’utf-8')csvWriter = csv.writer(csvFile)
This saves the tweets in a text file — tweets_collected.txt
for tweet in tweepy.Cursor(api.search,q="#Chhichhore -filter:retweets",count=200,
lang="en",
since="2019-09-07",until="2019-09-15").items():
csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])
This writes each tweet row-wise in the txt file you created. If you haven’t messed up with the indentation in Python, then hopefully by now a file would have been created in your folder containing this Python code which would have all the tweets of the specified #hashtag or @handle between the given dates.
In this case, we are collecting all tweets with #Chhichhore, which is a Bollywood movie released on 6th September and gaining popularity based on public reviews and opinions. Hopefully, at the end of this post, we will get an idea of what people are saying about this film.
Data Cleaning
The tweets collected are full of noisy data like Twitter handles, dates, URLs, and some special characters which are not useful for us.
So, in the next step of data cleaning, we remove noisy data from the collected tweets. I used Sublime Text editor and Regular Expressions to replace unwanted text with whitespace.
The regex applied for doing so –
(\w+:\/\/\w+\.\w+\/\w+) — URLs(\\x\w+) — Special Character words(@\w+ ) — Twitter handles
Sublime Text editor provides an option of replacing characters based on RegEx patterns. Remove other words as well which have been repeated but are irrelevant for our analysis i.e they don’t express a public opinion or emotion — like Chhichhore in our case.
Data Analysis
Now we are ready for analyzing our data. This could be done using Python as well, but I preferred R.
tm and wordcloud libraries of R were used. For further sentiment analysis, you can use the get_sentiment function under the syuzhet library of R. “nrc” method produces a decent result for sentiment analysis.
options(header=FALSE,stringsAsFactors=FALSE,fileEncoding="UTF-8")
aa=readLines("Insert full path of file",n=-1)
ab=Corpus(VectorSource(aa))
#tm_map functions to clean the data
ac=tm_map(ab,tolower)
ac=tm_map(ac,removeWords,stopwords("english"))
ac=tm_map(ac,removePunctuation)
ac=tm_map(ac,removeNumbers)
ac=tm_map(ac,stripWhitespace)
wordcloud(words = ac, min.freq = 80,
max.words=500, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Here min.freq and max.words variables can be tweaked. Min.freq for the minimum number of times the words should have been mentioned to be included in the wordcloud and max.words is to specify the number of words to be displayed in the word cloud.
We get the below word cloud displayed after running the R code.