disceRn Data

Monday, 23 January 2017

Interactive Visualizations using googleVis in R

Many people are aware of the powerful visualization features available in ggplot2, but when it comes to interactive visualizations googleVis is a pretty cool library provided by Google. The only downside is you would need to have an internet connection and the interactive visualization will be displayed on a browser, though Google is not going store your data.

I have found this kind of interactive visualization useful while volunteering for data analysis for NGO Manthan (Manthan Adhyayan Kendra). I have generated multiple kinds of interactive and intuitive visualizations based on the power generation and fly ash generation data provided by the organization. The visualization shown below uses the daily planned and actual power generation data (metric: Mega Watts), capacity (metric: Million Units) for central, state and private power plants across different states in India. A small piece of R code gives this beautiful visualization :

Capacity vs Actual Power Generation

I know this visualization shows too much information to digest easily, but I think that's what is unique about it. You can display so much data in just a single visualization ! To begin with, you can choose the X and Y axis variables from a plethora of variables in your data, along with having the choice of keeping the scale as log or linear. You can choose to display other variables as color and size of bubbles in the scatter plot. You even have a play button to display change in power generation over a period of time through motion of the bubbles, with the option of carving its trail. All of this can be seen as a bar graph or a line graph. This to me seems like the Big Mac of visualizations!

Did you know that a 2 line code can do this for you ?

library(googleVis)
plot(gvisMotionChart(Fruits, idvar="Fruit", timevar="Year"))

Don't believe me ? Check it out right now !

Please post your comments/suggestions about this post below.

Wednesday, 15 June 2016

PRISM Algorithm in R

Hello everyone,

Alas! After a long time I am writing a post:

This post is about the PRISM Algorithm which is used for generation of rules. PRISM Algorithm is available in Weka, but I wasn't able to find an R code for it, so I just wrote one.

PRISM algorithm dates back to 1987. This is the original paper on PRISM algorithm. PRISM is explained very well here.

The R code for PRISM Algorithm:

Creating a Word Cloud in R

Wordcloud is a group of words occurring with a higher frequency or together in a piece of text. This post will show how to create a Word Cloud. Inside a wordcloud, the positive, negative and neutral words can be shown separately and with their proportion. The data used for generating the wordclouds here is the thousands of book reviews from the e-commerce giant in USA. The reviews were for book-series: Harry Potter, Chronicles of Narnia, The Hobbit (Lord of the Rings). The tweets data from the previous post can also be used as an input for this post. The input file here must have the type of sentiment along with the text. (check the output of the sentiment analysis from last post).

R code:

library(wordcloud)

library(tm)

library(ColorBrewer)

narnia = read.csv("HarryPotter_wordcloud.csv") # read the input file containing the type of sentiment

narnia$text=gsub('[[:punct:]]', '', narnia$text) # Clean the data

narnia$text=gsub("[[:digit:]]", "", narnia$text)

narnia$text = tolower(narnia$text) # male it lower case

sents = levels(factor(narnia$sent)) # 3 levels: positive, neutral, negative

labels <- lapply(sents, function(x) paste(x,format(round((length((narnia[narnia$sent ==x,])$text)/length(narnia$sent)*100),2),nsmall=2),"%")) # % proportion calculation for levels

nemo = length(sents)

emo.docs = rep("", nemo)

for (i in 1:nemo) # text data categorized into 3 levels, docs created for each category

{

tmp = narnia[narnia$sent == sents[i],]$text

emo.docs[i] = paste(tmp,collapse=" ")

}

emo.docs = removeWords(emo.docs, stopwords("english")) # remove stopwords

corpus = Corpus(VectorSource(emo.docs))

tdm = TermDocumentMatrix(corpus)

tdm = as.matrix(tdm)

colnames(tdm) = labels

# comparison word cloud

comparison.cloud(tdm,max.words=100, colors = brewer.pal(nemo, "Set1"),

scale = c(3,.5), random.order = FALSE, title.size = 1.5)

The word cloud for Harry Potter reviews:

The word cloud for Chronicles of Narnia reviews:

The word cloud for Hobbit reviews:

Sentiment Analysis of Twitter Data

In my last post, I had explained about how to scrape data from websites. In this post I will describe how to obtain twitter data and perform Sentiment Analysis on it. The method presented here is a standard method of doing sentiment analysis and it can be extended to many other things like sentimental analysis of news articles, book reviews, movie reviews etc. Thus, the output obtained from the last post can be used as an input for the R code in this post!

Part 1: Comparing sentiment of tweets from Arvind Kejriwal and Narendra Modi
- Analysis of tweets from a particular Twitter handle

Here, the tweets from 2 interesting political personalities: Arvind Kejriwal and Narendra Modi have been considered.

#Libraries
--------------------------------------------------------------------------------------------------
library(twitteR)
library(plyr)
---------------------------------------------------------------------------------------------------
#OAuth Credentials and Authentication
# go to dev.twitter.com and create a new application after which you will get a consumerKey and consumerSecret
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- " Enter your consumerKey here "
consumerSecret <- " Enter your consumerSecret here "
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
consumerSecret=consumerSecret,
requestURL=reqURL,
accessURL=accessURL,
authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem",
package = "RCurl")) # OAuth handshake is required for every request you send
registerTwitterOAuth(twitCred)

#Main code
----------------------------------------------------------------------------------------------------
tweets <- userTimeline('narendramodi', 1500, cainfo="cacert.pem")
#put twitter handle here along with count of tweets you want, maximum allowed in 1 request is 1500
tweetsdf1 <- twListToDF(tweets)
write.csv(tweetsdf1, file= "kejri.csv")
# final output file containing info such as retweets,favourites,time-samp etc
tweets=read.csv("kejri.csv", header=T)
# a set positive and negative words to search from
pos_words=scan("positive-words.txt", what="character", comment.char=";")
neg_words=scan("negative-words.txt", what="character", comment.char=";")

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
#sentiment score calculating function
{
require(stringr) # we got a vector of sentences. plyr will handle a list
# we want a simple array of scores back, so we use
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence) # and convert to lower case:
sentence = tolower(sentence) # split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
tweets.score=score.sentiment(tweets$text, pos_words,neg_words, .progress='text')
write.csv(tweets.score,"sentiment_kejrii.csv") # this file shows sentiment score for each tweet

We obtain information such as favorite, favoriteCount, replyToSN, created (time and date), truncated, replyToSID, id, replyToUID, statusSource, screenName, retweetCount, isRetweet, retweeted, longitude, latitude along with the tweet content when we pass the request through OAuth for a particular twitter handle. The score calculating function will give +1 to a positive word, -1 for a negative word and 0 for neutral word. The sets of positive and negative words has been obtained from link. So finally, it gives a net sentiment score for each tweet which can be used further for analyses (time/day wise-sentiments,RT/Favourite count- timeline etc)
The following is the sentiment comparison of tweets from @ ArvindKejriwal and @narendramodi:

Comparing sentiment of tweets from Modi and Kejriwal

Well, you guessed it right. Modi has a higher proportion of positive tweets, while Kejriwal has a higher proportion of neutral and negative tweets compared to Modi.

Part 2: Analysis of tweets having a particular # tag
- Analysing #ModiAtMadison

The following lines of code will help you pick up tweets according to a specific hash tag. After this, the sentiment score calculating function can be used to obtain the sentiments, like the previous part.

tweets = searchTwitter(“#ModiAtMadison”, n=1500, cainfo= “cacert.pem”)
tweets.text = laply(tweets, function(t) t$getText())

PM Modi’s visit to the United States had garnered many eyeballs. The whole of India was watching him closely. His speech to the Indian-Americans at the Madison Square Garden was attended by over 30 Congressmen. Some people criticized him , some praised him. But this sentiment graph shows how twitter was buzzing about it:

Positive sentiment dominating the tweets for #ModiAtMadison

Saturday, 20 September 2014

Scraping News Articles with R

This is a short post on scraping news articles with R.

I wanted to download news articles automatically into CSV files. The aim of doing this was to use these articles for further analyses like sentiment analysis etc. So, I went through a lot of sources and blogs and concatenated a few methods to come up with a simple way to do it.

Now suppose I want to download articles of Times Of India, from the 'India Business' sub-section under the Business section. We will be use the XML package in R which is used to parse XML and HTML web pages. This package provides many approaches for both reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP.

The main crux of scraping is to know the HTML tags so that you can pick up the text from the right part of the HTML pages. Even if you view the page source, it will be cumbersome to get the right tags because the HTML code will look so messed up! The best alternative to this problem is to use Firebug along with the add-on FirePath which allows you to select items on a webpage and inspect their underlying tags. Also allows you to query your XPATH to see what it will select. On a sub-section page, there are only links given to the news articles.

Getting the right tags with FireBug

So first, we need to obtain the URL links for these articles. The R code:

library(XML)

url="http://timesofindia.indiatimes.com/business/india-business" # give the link of any subsection from a section

doc <- htmlParse(url) # parsing the URL

links <- xpathSApply(doc, "//a/@href") # tag for all the links obtained from FirePath

free(doc)

#links[[49]] # check where the URL's of articles actually begin, they would be in order

This will get you all the URL's that are available on that webpage, but choose the one's with news articles and they would be in order usually. So we will download 50 articles because that's the maximum number of articles available on a webpage under any sub-section as be seen from links[[49]] to links[[98]]. The link index for articles will be the same for any subsection of all the sections. Now inspect the tag used for main text and title of the article and again use xpathSApply() over it. To get all the URL's into the loop in an easy way, you can use the paste() function.

for (i in 1:50)

{

baseURL="http://timesofindia.indiatimes.com"

url = paste(baseURL,links[[48+i]],sep="")

toi.parse= htmlTreeParse(url,error=function(...){},useInternalNodes=TRUE , trim=TRUE)

article[i] = as.vector( xpathSApply(toi.parse, '//*[@class="Normal"]', xmlValue))

title [i]= as.vector( xpathSApply(toi.parse, '//*[@class="arttle"]', xmlValue))

}

df=data.frame(title,article)

write.csv(df, "C:\\Users\\ajit.d\\Desktop\\toi-business-articles.csv")

If you just want to obtain the brief descriptions of the article along with the title, just playing around with the tags using FirePath in FireBug will get you the idea.

theurl="timesofindia.indiatimes.com/business/india-business"

webpage <- getURL(theurl)

webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

title <- xpathSApply(pagetree, "//*[@id='fsts']/h2/a", xmlValue)