Saturday, 20 September 2014

Scraping News Articles with R

This is a short post on scraping news articles with R.

I wanted to download news articles automatically into CSV files. The aim of doing this was to use these articles for further analyses like sentiment analysis etc. So, I went through a lot of sources and blogs and concatenated a few methods to come up with a simple way to do it.

 Now suppose I want to download articles of Times Of India, from the 'India Business' sub-section under the Business section. We will be use the XML package in R which is used to parse XML and HTML web pages. This package provides many approaches for both reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP.

The main crux of scraping is to know the HTML tags so that you can pick up the text from the right part of the HTML pages. Even if you view the page source, it will be cumbersome to get the right tags because the HTML code will look so messed up! The best alternative to this problem is to use Firebug along with the add-on FirePath which allows you to select items on a webpage and inspect their underlying tags. Also allows you to query your XPATH to see what it will select. On a sub-section page, there are only links given to the news articles.

Getting the right tags with FireBug

So first, we need to obtain the URL links for these articles. The R code:

 library(XML)
 url="http://timesofindia.indiatimes.com/business/india-business"           # give the link of any subsection from a section
 doc <- htmlParse(url)                                                                                # parsing the URL
 links <- xpathSApply(doc, "//a/@href")                          # tag for all the links obtained from FirePath                         
 free(doc)
  #links[[49]]                                     # check where the URL's of articles actually begin, they would be in order

This will get you all the URL's that are available on that webpage, but choose the one's with news articles and they would be in order usually. So we will download 50 articles because that's the maximum number of articles available on a webpage under any sub-section as be seen from links[[49]] to links[[98]]. The link index for articles will be the same for any subsection of all the sections. Now inspect the tag used for main text and title of the article and again use xpathSApply() over it. To get all the URL's into the loop in an easy way, you can use the paste() function.

for (i in 1:50)
{
baseURL="http://timesofindia.indiatimes.com"
url = paste(baseURL,links[[48+i]],sep="")
toi.parse= htmlTreeParse(url,error=function(...){},useInternalNodes=TRUE , trim=TRUE)
article[i] = as.vector( xpathSApply(toi.parse, '//*[@class="Normal"]', xmlValue))  
title [i]= as.vector( xpathSApply(toi.parse, '//*[@class="arttle"]', xmlValue))
}
df=data.frame(title,article)
write.csv(df, "C:\\Users\\ajit.d\\Desktop\\toi-business-articles.csv")            

If you just want to obtain the brief descriptions of the article along with the title, just playing around with the tags using FirePath in FireBug will get you the idea.

theurl="timesofindia.indiatimes.com/business/india-business"
 webpage <- getURL(theurl)
 webpage <- readLines(tc <- textConnection(webpage)); close(tc)
 pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
 title <- xpathSApply(pagetree, "//*[@id='fsts']/h2/a", xmlValue)
 brief_description <- xpathSApply(pagetree, "//*[@id='fsts']/span/span[2]", xmlValue)



timesofindia.indiatimes.com: Copyright of ©2014 Times Internet Limited

No comments:

Post a Comment