```{r setup, include=FALSE} opts_chunk$set(cache=TRUE) ``` Web Scraping part 2: Digging in ======================================================== width: 1200 author: Rolf Fredheim and Aiora Zabala date: University of Cambridge font-family: 'Rockwell' 25/02/2014 Today we will: ======================================================== - Become good at writing bad* functions - Learn how to access information in web-pages (HTML, XML, etc) Get the docs: http://fredheir.github.io/WebScraping/Lecture2/p2.html https://raw.github.com/fredheir/WebScraping/master/Lecture2/p2.Rpres http://fredheir.github.io/WebScraping/Lecture2/p2.R <br> <br> <br> <small> *mabye 'utilitarian', 'expedient', and 'functional' would be more accurate, if more pompous descriptions of this practice</small> Digital data collection ======================= - Devise a means of accessing data - Retrieve that data - tabulate and store the data Last week step two involved JSON. Today we work with HTML. Steps 1 and 3 do not change the same Revisiting the programming basics ================ type:section Good functions ======================================================== What makes a function good? - Clear name - Instructions to user - Short - Performs a single task - Is efficient - Can handle errors - Is predictable - Does not use **global variables** (someone explain?) Bad functions ======================================================== ... break the rules/ guidelines above. But they can be useful to: - hide a script behind a function - get overview - move on to the next task - don't worry about methods and error handling: simplifies the process If writing functions for your own use, it's ok* to write bad functions. <small>*But basic notes throughout the code reminding yourself what you did will be invaluable</small> Revision ============ What is a variable? What are they for? Variables ================= type:sq Two main purposes: quicker to write ```{r} uni= "The University of Cambridge" uni ``` *** quicker to change the code. It is good practice to declare variables near the start of your code Paying tax: 9400 tax free ```{r} (20000-9440)*20/100 #OR: wage=20000 taxFree=9400 rate=20 (wage-taxFree)*rate/100 ``` Functions without variables ================ ```{r} printName <- function(){ print ("My name is Rolf Fredheim") } printName() ``` This is a useless function. But sometimes, if we have many lines of code requiring no particular input, it can be useful to file them away like this. e.g. for simulations ============ ```{r} sillySimulation <- function(){ x1 <- runif(500,80,100) x2 <- runif(500,0,100) v1 <- c(x1,x2) x3 <- runif(1000,0,100) df <- data.frame(v1,x3) require(ggplot2) print(ggplot(df, aes(v1,x3))+geom_point()+ggtitle("simulation of some sort")) } ``` ===== Just as this slide hides the code on the previous slide, so the function hides the underlying code. ```{r} sillySimulation() ``` Inserting variables ========= Let's hammer home how to use variables what variables could we add to the function below? ```{r} desperateTimes <- function(){ print(paste0("Rolf is struggling to finish his PhD on time. Time remaining: 6 months")) } ``` Name =========== ```{r} desperateTimes <- function(name){ print(paste0(name ," is struggling to finish his PhD on time. Time remaining: 6 months")) } desperateTimes(name="Tom") ``` Gender =========== type:sq we specify a default value ```{r} desperateTimes <- function(name,gender="m"){ if(gender=="m"){ pronoun="his" }else{ pronoun="her" } print(paste0(name ," is struggling to finish ",pronoun," PhD on time. Time remaining: 6 months")) } desperateTimes(name="Tanya",gender="f") ``` Is this a good function? Why (not)? degree ============== ```{r} desperateTimes <- function(name,gender="m",degree){ if(gender=="m"){ pronoun="his" }else{ pronoun="her" } print(paste0(name ," is struggling to finish ",pronoun," ",degree," on time. Time remaining: 6 months")) } desperateTimes(name="Rolf",gender="m","Mphil") ``` Days til deadline ============ type:sq1 ```{r} require(lubridate) require(ggplot2) deadline=as.Date("2014-09-01") daysLeft <- deadline-Sys.Date() totDays <- deadline-as.Date("2011-10-01") print(daysLeft) print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft))) ``` part2 ========== type:sq ```{r} print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100))) df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed")) ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill") ``` =========== type:sq1 We could put all this code in a function, and forget about it ```{r} timeToWorry <- function(){ require(lubridate) deadline=as.Date("2014-09-01") daysLeft <- deadline-Sys.Date() totDays <- deadline-as.Date("2011-10-01") print(daysLeft) print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft))) print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100))) df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed")) ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill") } ``` File it away until in need of a reminder ====== ```{r} timeToWorry() ``` Finishing up last week's material ============================= type:section What does this have to do with webscraping? ============ Bad functions like this will help us to break the task into bitesize chunks Rather than working with long unruly scripts, we write a little script that works, identify any necessary variables, and file it away. A typical structure might be: - Load packages, set working directory - Download one example - Extract the necessary information - Store the information - Repeat. -> either by looping, or by completing one step at a time Last week's code ================= type:sq example and explanation downloading data Check the code is correct ```{r} require(rjson) url <- "http://stats.grok.se/json/en/201201/web_scraping" raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) rd.views <- rd$daily_views rd.views <- unlist(rd.views) rd <- as.data.frame(rd.views) rd$date <- rownames(rd) rownames(rd) <- NULL rd ``` Turn it into a function ======================== type:sq "url" is the only thing that changes. Thus we have one variable At the end we "return" the data to the user ```{r} getData <- function(url){ require(rjson) raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) rd.views <- rd$daily_views rd.views <- unlist(rd.views) rd <- as.data.frame(rd.views) rd$date <- rownames(rd) rownames(rd) <- NULL rd$date <- as.Date(rd$date) return(rd) } ``` Now we can forget about *how* we download data, after checking the code works: getData("http://stats.grok.se/json/en/201201/web_scraping") ============ The script and the function achieve exactly the same thing. But: compressing the code to a single function is good to relieve the brain, and to de-clutter your code Creating the URLS ========= type:sq1 ```{r} getUrls <- function(y1,y2,term){ root="http://stats.grok.se/json/en/" urls <- NULL for (year in y1:y2){ for (month in 1:9){ urls <- c(urls,(paste(root,year,0,month,"/",term,sep=""))) } for (month in 10:12){ urls <- c(urls,(paste(root,year,month,"/",term,sep=""))) } } return(urls) } ``` Put it together ====== type:sq2 ```{r} #create some URLS urls <- getUrls(y1=2013,y2=2014,"Euromaidan") #get data for each of them and store that data results=NULL for (url in urls){ results <- rbind(results,getData(url)) } head(results) ``` Inspect the data, visualise it *** ```{r} ggplot(tail(results,100),aes(date,rd.views))+geom_line() ``` OK, lets move on ============= type:section - HTML - XPath - CSS and attributes Getting to know HTML structure ============================== http://en.wikipedia.org/wiki/Euromaidan Let's look at this webpage - Headings - Images - links - references - tables To look at the code (in Google Chrome), right-click somewhere on the page and select 'inspect element' Tree-structure (parents, siblings) Back to Wikipedia ==================== HTML tags. They come in pairs and are surrounded by these guys: <> e.g. a heading might look like this: \<h1\>MY HEADING\</h1\> <h1>MY HEADING</h1> Which others do you know or can you find? HTML tags ====================== - \<html>: starts html code - \<head> : contains meta data etc - \<script> : e.g. javascript to be loaded - \<style> : css code - \<meta> : denotes document properties, e.g. author, keywords - \<title> : - \<body> : HTML tags2 ====================== - \<div>, \<span> :these are used to break up a document into sections and boxes - \<h1>,\<h2>,\<h3>,\<h4>,\<h5> Different levels of heading - \<p> : paragraph - \<br> : line break - and others: \<a>, \<ul>, \<tbody>, \<th>, \<td>, \<ul>, \<ul>, <img> Principles of scraping ============= - Identify the tag - Download the web-page - Extract content matching the tag - Save the content - Optional: repeat Download the web page =============== type:sq XML Parser: >Parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree. ```{r} require(RCurl) require(XML) url <- "http://en.wikipedia.org/wiki/Euromaidan" SOURCE <- getURL(url,encoding="UTF-8") #Download the page #this is a very very long line. Let's not print it. Instead: substring (SOURCE,1,200) PARSED <- htmlParse(SOURCE) #Format the html code d ``` Accessing HTML elements in R ======== type:sq Reading: http://www.w3schools.com/xpath/xpath_intro.asp we can use XPath expressions to extract elements from HTML ```{r} xpathSApply(PARSED, "//h1") ``` Not so pretty. But! Specifying xmlValue strips away the surrounding code and returns only the content of the tag ```{r} xpathSApply(PARSED, "//h1",xmlValue) ``` Digging deeper ===================== What about other headings? ```{r} xpathSApply(PARSED, "//h3",xmlValue) ``` ========= and links ```{r} length(xpathSApply(PARSED, "//a/@href")) ``` there's loads of them. We need to be more selective CSS and XPath =============== type: sq web-designers use Cascading Style Sheets to determine the way a webpage looks Like variables: change the style, rather than the every item on a page I use CSS for these slides, check out the code for this page <strong>CSS allows us to make better selections, by latching onto tags</strong> **Xpath allows us to move up and down the html tree structure** CSS can be an html **attribute** CSS example here Get references ======================= type:sq Content of the references ```{r} head(xpathSApply(PARSED, "//span[@class='citation news']",xmlValue)) ``` Content of the references ```{r} head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href")) ``` *** Test that these work, using the function from last week ```{r} links <- (xpathSApply(PARSED, "//span[@class='citation news']/a/@href")) browseURL(links[1]) ``` So if you wanted to, you could scrape these links in turn. tree-structure is navigated a bit like that on your computer (c:/windows/system) XPath ================== Is a language for querying XML Reading and examples: http://www.w3schools.com/xml/xml_xpath.asp The element in quotes below is an *XPath expression* ```{r eval=F} head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href")) ``` Fundamental XPath Syntax =============== - / Select from the root - // Select anywhere in document - @ Select attributes. Use in square brackets In this example, we select all elements of 'span' ...Which have an **attribute** "class" of the value "citation news" ...then we select all links ...and return all attributes labeled "href" (the urls) ```{r eval=F} head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href")) ``` XPath2 ============ type:sq1 Like in R, we use square brackets to make selections What does this select? ```{r} head(xpathSApply(PARSED, "//span[@class='citation news'][17]/a/@href")) ``` Wildcards ====================== type:sq2 We can also use wildcards: - * selects any node or tag - @* selects any attribute (used to define nodes) ```{r} (xpathSApply(PARSED, "//*[@class='citation news'][17]/a/@href")) (xpathSApply(PARSED, "//span[@class='citation news'][17]/a/@*")) ``` XPath3 ================== You can use functions, e.g. for partial matches -> useful if there are subtle variations within or between pages ```{r eval=F} head(xpathSApply(PARSED, "//span[starts-with(@class,'citation')][17]/a/@href")) head(xpathSApply(PARSED, "//span[contains(@class,'citation')][17]/a/@href")) ``` example syntax [function(attribute,string)] Examples =============== type:section BBC Guardian Example - BBC article ================ type:sq ```{r} url <- "http://www.bbc.co.uk/news/world-europe-26333587" SOURCE <- getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters PARSED <- htmlParse(SOURCE) (xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)) (xpathSApply(PARSED, "//span[@class='date']",xmlValue)) #Meta field for better formatting (xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content")) ``` Make a scraper =============== ```{r} bbcScraper <- function(url){ SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title=(xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)) date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content")) return(c(title,date)) } ``` test It ============ ```{r} bbcScraper("http://www.bbc.co.uk/news/world-middle-east-26333533") bbcScraper("http://www.bbc.co.uk/sport/0/football/26332893") ``` Ah. A problem. Not all pages have meta fields Adding exceptions ============== type:sq2 As you build your scraper you will find the need to add multiple possible options. Why is this? ```{r} bbcScraper2 <- function(url){ title=date=NA #Return empty values in case field not found SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title=(xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue)) date=(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content")) if (is.null(date)){ date=(xpathSApply(PARSED, "//span[@class='date']",xmlValue)) } return(c(title,as.character(date))) } ``` ```{r} bbcScraper2("http://www.bbc.co.uk/news/world-middle-east-26333533") bbcScraper2("http://www.bbc.co.uk/sport/0/football/26332893") ``` Guardian ======================= type:sq 2 start with the bbc scraper as a base, then change the necessary fields ```{r} url <- "http://www.theguardian.com/commentisfree/2014/feb/25/how-much-cost-growers-bananas-68p-per-kilo" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) xpathSApply(PARSED, "//h1[contains(@itemprop,'headline')]",xmlValue) xpathSApply(PARSED, "//a[@class='contributor']",xmlValue) xpathSApply(PARSED, "//time[@itemprop='datePublished']",xmlValue) ``` Guardian continued =========== type:sq2 ```{r} xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime") xpathSApply(PARSED, "//a[@rel='tag']",xmlValue) unique(xpathSApply(PARSED, "//a[@rel='tag']",xmlValue)) xpathSApply(PARSED, "//div[@id='article-body-blocks']",xmlValue) xpathSApply(PARSED, "//div[@id='article-body-blocks']/p",xmlValue) ``` Guardian scraper ====================== type:sq ```{r} guardianScraper <- function(url){ SOURCE <- getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters PARSED <- htmlParse(SOURCE) title <- xpathSApply(PARSED, "//h1[contains(@itemprop,'headline')]",xmlValue) author <- xpathSApply(PARSED, "//a[@class='contributor']",xmlValue) time <- xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime") tags <- unique(xpathSApply(PARSED, "//a[@rel='tag']",xmlValue)) text <- xpathSApply(PARSED, "//div[@id='article-body-blocks']/p",xmlValue) return(list(title=title, author=author, time=time, tags=paste(tags,collapse="|") ,text=paste(text,collapse="|"))) } ``` Using the scraper ==================== type:sq1 ```{r} a <- guardianScraper(url) a["title"] a[["title"]] a[["tags"]] ``` Storing the data ===================== type:sq Small scale operations: an R data.frame, using rbind Problems: - copying the table - data frames not good for storing text XML SQLite database Cons: Less intuitive setup and access Pros: much faster. Archive text and forget about it. Use SOLR to index and get fast retrieval. BUT: You can learn enough SQL in a few hours. And R has excellent packages for saving and retrieving information from SQL. Example with dataframe. ====================== type:sq1 Get some URLs, e.g. top articles of the day ```{r} url <- "http://www.theguardian.com/uk" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) urls <- xpathSApply(PARSED, "//div[@class='tabs-container']//*/@href") #This is a bit tricky, apologies. There may be a better way d <- lapply(urls,guardianScraper) data <- data.frame(matrix(unlist(d),nrow=2,byrow=T)) colnames(data) <- c("title","author","time","tags","text") as.character(data$tags) ``` Accessing this data later on ==================== the data.table packes is very useful for this, but it takes a bit of learning. ```{r} require(stringr) #return title of texts mentioning Chelsea data[grep("Chelsea",data$tags),"title"] #return tags of texts mentioning Chelsea unlist(str_split(data[grep("Chelsea",data$tags),"tags"],"\\|")) ``` Practice time ================ type: section write a scraper for: - http://www.mirror.co.uk/ - http://www.telegraph.co.uk/ - http://www.independent.co.uk Your own interests here? Advanced stuff: comments ======================= type:sq Harder - find the iframe ```{r} url <- "http://discussion.theguardian.com/discussion/p/3n34d" SOURCE <- getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters PARSED <- htmlParse(SOURCE) xpathSApply(PARSED, "//div[@class='d2-body'][1]",xmlValue) xpathSApply(PARSED, "//a[@class='d2-username']",xmlValue) ``` Comments2 ================ type:sq2 So, we need a way to find the iframe. But the iframe is loaded by a script. ```{r} url <- "http://www.theguardian.com/commentisfree/2014/feb/25/how-much-cost-growers-bananas-68p-per-kilo" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) links <- xpathSApply(PARSED, "//@href") shortUrl <- links[grep("//gu\\.",links)][1] require(stringr) temp <- unlist(str_split(shortUrl,"/")) target <- temp[length(temp)] discussionURL <- paste0("http://discussion.theguardian.com/discussion/p/",target) SOURCE <- getURL(discussionURL,encoding="UTF-8") PARSED <- htmlParse(SOURCE) xpathSApply(PARSED, "//a[@class='d2-username']",xmlValue) ``` Why might this be useful? =============== Studying reception (e.g. sentiment) Assessing how controversial a subject is (does anyone care enough to comment) Who always comments? - Detecting spammers or paid commenters (e.g. articles about Russia) Solutions (1: Mirror) ================== type:sq1 Can you turn these into scraper functions? Mirror ```{r} #MIRROR url <- "http://www.mirror.co.uk/news/world-news/oscar-pistorius-trial-murder-reeva-3181393" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title <- xpathSApply(PARSED, "//h1",xmlValue) author <- xpathSApply(PARSED, "//li[@class='author']",xmlValue) time <- xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime") ``` Telegraph ============ type:sq1 ```{r} #Telegraph url <- "http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue) author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue) time <- xpathSApply(PARSED, "//p[@class='publishedDate']",xmlValue) ``` Independent ============== type:sq1 ```{r eval=F} #Independent url <- "http://www.independent.co.uk/news/world/asia/leopard-on-the-loose-puts-indian-city-of-meerut-on-lockdown-9151273.html" SOURCE <- getURL(url,encoding="UTF-8") PARSED <- htmlParse(SOURCE) title <- xpathSApply(PARSED, "//h1",xmlValue) author <- xpathSApply(PARSED, "//span[@class='authorName']",xmlValue) time <- xpathSApply(PARSED, "//p[@class='dateline']",xmlValue) ``` Finally ============ What shall we do next week? More work on this? Work with APIs? Look at gathering links for scraping, downloading files, etc? <!-- CSS formatting used in these slides --> <style>.s1 .reveal .state-background { background: #E0E0FF; } .sq1 .reveal section code { font-size:145%; } .sq1 .reveal section p { font-size:100%; } .sq .reveal section code { font-size:125%; } .sq .reveal section p { font-size:85%; } .sq2 .reveal section code { font-size:100%; } .sq2 .reveal section p { font-size:70%; } .reveal blockquote { display: block; position: relative; width: 100%; margin: 5px auto; padding: 5px; font-style: normal; background: #C6D7DC; border: 1px solid #C6D7DC; box-shadow: none; } .reveal pre { margin-top: 0; max-width: 100%; width: 100%; border: 1px solid #ccc; white-space: pre-wrap; margin-bottom: 1em; } .reveal pre code { /* display: block; padding: 0.5em; */ font-size: 1.6em; line-height: 1.1em; background-color: white; overflow: visible; max-height: none; word-wrap: normal; } .reveal section centered { text-align: center; border: none; } </style>