```{r setup, include=FALSE}
opts_chunk$set(cache=TRUE)
```

Web Scraping part 2: Digging in
========================================================
width: 1200
author: Rolf Fredheim and Aiora Zabala
date: University of Cambridge
font-family: 'Rockwell'
25/02/2014

Today we will:
========================================================

- Become good at writing bad* functions
- Learn how to access information in web-pages (HTML, XML, etc)

Get the docs:
http://fredheir.github.io/WebScraping/Lecture2/p2.html

https://raw.github.com/fredheir/WebScraping/master/Lecture2/p2.Rpres

http://fredheir.github.io/WebScraping/Lecture2/p2.R


<br>
<br>
<br>
<small> *mabye 'utilitarian', 'expedient', and 'functional' would be more accurate, if more pompous descriptions of this practice</small>

Digital data collection
=======================

- Devise a means of accessing data
- Retrieve that data
- tabulate and store the data

Last week step two involved JSON. 

Today we work with HTML. 

Steps 1 and 3 do not change the same

Revisiting the programming basics
================
type:section

Good functions
========================================================

What makes a function good?
- Clear name
- Instructions to user
- Short
- Performs a single task
- Is efficient
- Can handle errors
- Is predictable
- Does not use **global variables** (someone explain?)

Bad functions
========================================================

... break the rules/ guidelines above. But they can be useful to:
- hide a script behind a function
- get overview
- move on to the next task
- don't worry about methods and error handling: simplifies the process

If writing functions for your own use, it's ok* to write bad functions. 

<small>*But basic notes throughout the code reminding yourself what you did will be invaluable</small>

Revision
============
What is a variable?

What are they for?


Variables
=================
type:sq
Two main purposes:

quicker to write
```{r}
uni= "The University of Cambridge"
uni
```
***
quicker to change the code. It is good practice to declare variables near the start of your code

Paying tax:
9400 tax free
```{r}
(20000-9440)*20/100
#OR:
wage=20000
taxFree=9400
rate=20
(wage-taxFree)*rate/100

```



Functions without variables
================

```{r}
printName <- function(){
  print ("My name is Rolf Fredheim")
}

printName()
```

This is a useless function. But sometimes, if we have many lines of code requiring no particular input, it can be useful to file them away like this.

e.g. for simulations
============
```{r}
sillySimulation <- function(){
x1 <- runif(500,80,100)
x2 <- runif(500,0,100)
v1 <- c(x1,x2)

x3 <- runif(1000,0,100)

df <- data.frame(v1,x3)
require(ggplot2)

print(ggplot(df, aes(v1,x3))+geom_point()+ggtitle("simulation of some sort"))
}
```


=====
Just as this slide hides the code on the previous slide, so the function hides the underlying code. 
```{r}
sillySimulation()
```

Inserting variables
=========
Let's hammer home how to use variables

what variables could we add to the function below?
```{r}
desperateTimes <- function(){
  print(paste0("Rolf is struggling to finish his PhD on time. Time remaining: 6 months"))
}
```


Name
===========

```{r}
desperateTimes <- function(name){
  print(paste0(name ," is struggling to finish his PhD on time. Time remaining: 6 months"))
}
desperateTimes(name="Tom")
```

Gender
===========
type:sq
we specify a default value
```{r}
desperateTimes <- function(name,gender="m"){
  if(gender=="m"){
    pronoun="his"
  }else{
    pronoun="her"
  }
  
  print(paste0(name ," is struggling to finish ",pronoun," PhD on time. Time remaining: 6 months"))
}
desperateTimes(name="Tanya",gender="f")
```
Is this a good function? Why (not)?

degree
==============
```{r}
desperateTimes <- function(name,gender="m",degree){
  if(gender=="m"){
    pronoun="his"
  }else{
    pronoun="her"
  }
  
  print(paste0(name ," is struggling to finish ",pronoun," ",degree," on time. Time remaining: 6 months"))
}
desperateTimes(name="Rolf",gender="m","Mphil")
```


Days til deadline
============
type:sq1
```{r}
require(lubridate)
require(ggplot2)
deadline=as.Date("2014-09-01")
daysLeft <- deadline-Sys.Date()
totDays <- deadline-as.Date("2011-10-01")
print(daysLeft)
print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft)))
```
part2
==========
type:sq
```{r}
print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100)))
df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed"))
ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill")
```


===========
type:sq1

We could put all this code in a function, and forget about it
```{r}
timeToWorry <- function(){
  require(lubridate)
  deadline=as.Date("2014-09-01")
  daysLeft <- deadline-Sys.Date()
  totDays <- deadline-as.Date("2011-10-01")
  print(daysLeft)
  print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft)))
  print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100)))
  df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed"))
  ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill")
}
```

File it away until in need of a reminder
======
```{r}
timeToWorry()

```


Finishing up last week's material
=============================
type:section


What does this have to do with webscraping?
============
Bad functions like this will help us to break the task into bitesize chunks

Rather than working with long unruly scripts, we write a little script that works, identify any necessary variables, and file it away. A typical structure might be:

- Load packages, set working directory
- Download one example
- Extract the necessary information
- Store the information
- Repeat.

-> either by looping, or by completing one step at a time


Last week's code
=================
type:sq
example and explanation

downloading data
Check the code is correct
```{r}
require(rjson)
url <-  "http://stats.grok.se/json/en/201201/web_scraping"
raw.data <- readLines(url, warn="F") 
rd  <- fromJSON(raw.data)
rd.views <- rd$daily_views 
rd.views <- unlist(rd.views)
rd <- as.data.frame(rd.views)
rd$date <- rownames(rd)
rownames(rd) <- NULL
rd
```


Turn it into a function
========================
type:sq
"url" is the only thing that changes. Thus we have one variable
At the end we "return" the data to the user
```{r}
getData <- function(url){
  require(rjson)
  raw.data <- readLines(url, warn="F") 
  rd  <- fromJSON(raw.data)
  rd.views <- rd$daily_views 
  rd.views <- unlist(rd.views)
  rd <- as.data.frame(rd.views)
  rd$date <- rownames(rd)
  rownames(rd) <- NULL
  rd$date <- as.Date(rd$date)
  return(rd)
}
```
Now we can forget about *how* we download data, after checking the code works:
getData("http://stats.grok.se/json/en/201201/web_scraping")

============
The script and the function achieve exactly the same thing.

But: compressing the code to a single function is good to relieve the brain, and to de-clutter your code

Creating the URLS
=========
type:sq1
```{r}
getUrls <- function(y1,y2,term){
  root="http://stats.grok.se/json/en/"
    urls <- NULL
    for (year in y1:y2){
      for (month in 1:9){
        urls <- c(urls,(paste(root,year,0,month,"/",term,sep="")))
    	}
    
    	for (month in 10:12){
      	urls <- c(urls,(paste(root,year,month,"/",term,sep="")))
    	}
    }
    return(urls)
}
```


Put it together
======
type:sq2
```{r}
#create some URLS
urls <- getUrls(y1=2013,y2=2014,"Euromaidan")

#get data for each of them and store that data
results=NULL
for (url in urls){
  results <- rbind(results,getData(url))
}
head(results)
```
Inspect the data,
visualise it
***
```{r}
ggplot(tail(results,100),aes(date,rd.views))+geom_line()
```



OK, lets move on
=============
type:section

- HTML
- XPath
- CSS and attributes


Getting to know HTML structure
==============================

http://en.wikipedia.org/wiki/Euromaidan

Let's look at this webpage

- Headings
- Images
- links
- references
- tables

To look at the code (in Google Chrome), right-click somewhere on the page and select 'inspect element'

Tree-structure (parents, siblings)

Back to Wikipedia
====================
HTML tags.

They come in pairs and are surrounded by these guys:
<>

e.g. a heading might look like this:

\<h1\>MY HEADING\</h1\>
<h1>MY HEADING</h1>

Which others do you know or can you find?

HTML tags
======================

- \<html>: starts html code
- \<head> : contains meta data etc
- \<script> : e.g. javascript to be loaded
- \<style> : css code
- \<meta> : denotes document properties, e.g. author, keywords
- \<title> : 
- \<body> : 

HTML tags2
======================

- \<div>, \<span> :these are used to break up a document into sections and boxes
- \<h1>,\<h2>,\<h3>,\<h4>,\<h5> Different levels of heading
- \<p> : paragraph
- \<br> : line break
- and others: \<a>, \<ul>, \<tbody>, \<th>, \<td>, \<ul>, \<ul>, <img>


Principles of scraping
=============
- Identify the tag
- Download the web-page
- Extract content matching the tag
- Save the content
- Optional: repeat

Download the web page
===============
type:sq
XML Parser: 
>Parses an XML or HTML file or string containing XML/HTML content, and generates an R structure representing the XML/HTML tree. 

```{r}
require(RCurl)
require(XML)

url <- "http://en.wikipedia.org/wiki/Euromaidan"
SOURCE <-  getURL(url,encoding="UTF-8") #Download the page
#this is a very very long line. Let's not print it. Instead:
substring (SOURCE,1,200)
PARSED <- htmlParse(SOURCE) #Format the html code d

```


Accessing HTML elements in R
========
type:sq

Reading: 
http://www.w3schools.com/xpath/xpath_intro.asp

we can use XPath expressions to extract elements from HTML

```{r}
xpathSApply(PARSED, "//h1")
```

Not so pretty. But! Specifying xmlValue strips away the surrounding code and returns only the content of the tag

```{r}
xpathSApply(PARSED, "//h1",xmlValue)
```

Digging deeper
=====================
What about other headings?

```{r}
xpathSApply(PARSED, "//h3",xmlValue)
```

=========
and links
```{r}
length(xpathSApply(PARSED, "//a/@href"))
```
there's loads of them. We need to be more selective


CSS and XPath
===============
type: sq
web-designers use Cascading Style Sheets to determine the way a webpage looks

Like variables: change the style, rather than the every item on a page

I use CSS for these slides, check out the code for this page

<strong>CSS allows us to make better selections, by latching onto tags</strong>

**Xpath allows us to move up and down the html tree structure**

CSS can be an html **attribute**

CSS example here


Get references
=======================
type:sq

Content of the references
```{r}
head(xpathSApply(PARSED, "//span[@class='citation news']",xmlValue))
```

Content of the references 
```{r}
head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href"))
```
***
Test that these work, using the function from last week


```{r}
links <- (xpathSApply(PARSED, "//span[@class='citation news']/a/@href"))
browseURL(links[1])
```

So if you wanted to, you could scrape these links in turn.

tree-structure is navigated a bit like that on your computer (c:/windows/system)

XPath
==================

Is a language for querying XML

Reading and examples: http://www.w3schools.com/xml/xml_xpath.asp

The element in quotes below is an *XPath expression*

```{r eval=F}
head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href"))

```


Fundamental XPath Syntax
===============

- /      Select from the root
- //     Select anywhere in document
- @      Select attributes. Use in square brackets

In this example, we select all elements of 'span'
...Which have an **attribute** "class" of the value "citation news"
...then we select all links
...and return all attributes labeled "href" (the urls)


```{r eval=F}
head(xpathSApply(PARSED, "//span[@class='citation news']/a/@href"))
```


XPath2
============
type:sq1

Like in R, we use square brackets to make selections

What does this select?
```{r}
head(xpathSApply(PARSED, "//span[@class='citation news'][17]/a/@href"))
```

Wildcards
======================
type:sq2

We can also use wildcards:

- * selects any node or tag
- @* selects any attribute (used to define nodes)
```{r}
(xpathSApply(PARSED, "//*[@class='citation news'][17]/a/@href"))
(xpathSApply(PARSED, "//span[@class='citation news'][17]/a/@*"))
```


XPath3
==================
You can use functions, e.g. for partial matches
-> useful if there are subtle variations within or between pages
```{r eval=F}
head(xpathSApply(PARSED, "//span[starts-with(@class,'citation')][17]/a/@href"))
head(xpathSApply(PARSED, "//span[contains(@class,'citation')][17]/a/@href"))
```

example syntax
[function(attribute,string)]


Examples
===============
type:section

BBC

Guardian


Example - BBC article
================
type:sq
```{r}
url <- "http://www.bbc.co.uk/news/world-europe-26333587"
SOURCE <-  getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
PARSED <- htmlParse(SOURCE)
(xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue))
(xpathSApply(PARSED, "//span[@class='date']",xmlValue))
#Meta field for better formatting
(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
```


Make a scraper
===============

```{r}
bbcScraper <- function(url){
  SOURCE <-  getURL(url,encoding="UTF-8")
  PARSED <- htmlParse(SOURCE)
  title=(xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue))
  date=as.character(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
  return(c(title,date))
}
```


test It
============
```{r}
bbcScraper("http://www.bbc.co.uk/news/world-middle-east-26333533")
bbcScraper("http://www.bbc.co.uk/sport/0/football/26332893")
```

Ah. A problem. Not all pages have meta fields

Adding exceptions
==============
type:sq2
As you build your scraper you will find the need to add multiple possible options.

Why is this?
```{r}
bbcScraper2 <- function(url){
  title=date=NA #Return empty values in case field not found
  SOURCE <-  getURL(url,encoding="UTF-8") 
  PARSED <- htmlParse(SOURCE)
  title=(xpathSApply(PARSED, "//h1[@class='story-header']",xmlValue))
  date=(xpathSApply(PARSED, "//meta[@name='OriginalPublicationDate']/@content"))
  if (is.null(date)){
    date=(xpathSApply(PARSED, "//span[@class='date']",xmlValue))
  }
  return(c(title,as.character(date)))
}
```

```{r}
bbcScraper2("http://www.bbc.co.uk/news/world-middle-east-26333533")
bbcScraper2("http://www.bbc.co.uk/sport/0/football/26332893")
```


Guardian
=======================
type:sq 2
start with the bbc scraper as a base, then change the necessary fields

```{r}
url <- "http://www.theguardian.com/commentisfree/2014/feb/25/how-much-cost-growers-bananas-68p-per-kilo"
  SOURCE <-  getURL(url,encoding="UTF-8")
  PARSED <- htmlParse(SOURCE)
  xpathSApply(PARSED, "//h1[contains(@itemprop,'headline')]",xmlValue)
  xpathSApply(PARSED, "//a[@class='contributor']",xmlValue)
  xpathSApply(PARSED, "//time[@itemprop='datePublished']",xmlValue)
```
Guardian continued
===========
type:sq2

```{r}
  xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime")
  xpathSApply(PARSED, "//a[@rel='tag']",xmlValue)
  unique(xpathSApply(PARSED, "//a[@rel='tag']",xmlValue))
  xpathSApply(PARSED, "//div[@id='article-body-blocks']",xmlValue)
  xpathSApply(PARSED, "//div[@id='article-body-blocks']/p",xmlValue)
```


Guardian scraper
======================
type:sq
```{r}
guardianScraper <- function(url){
  SOURCE <-  getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
  PARSED <- htmlParse(SOURCE)
  title <- xpathSApply(PARSED, "//h1[contains(@itemprop,'headline')]",xmlValue)
  author <- xpathSApply(PARSED, "//a[@class='contributor']",xmlValue)
  time  <- xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime")
  tags <- unique(xpathSApply(PARSED, "//a[@rel='tag']",xmlValue))
  text <- xpathSApply(PARSED, "//div[@id='article-body-blocks']/p",xmlValue)
  return(list(title=title,
              author=author,
              time=time,
              tags=paste(tags,collapse="|")
              ,text=paste(text,collapse="|")))
}
```

Using the scraper
====================
type:sq1
```{r}
a <- guardianScraper(url)
a["title"]
a[["title"]]
a[["tags"]]
```

Storing the data
=====================
type:sq
Small scale operations: an R data.frame, using rbind

Problems:
- copying the table
- data frames not good for storing text

XML

SQLite database

Cons: Less intuitive setup and access
Pros: much faster. Archive text and forget about it. Use SOLR to index and get fast retrieval. 

BUT: You can learn enough SQL in a few hours. And R has excellent packages for saving and retrieving information from SQL. 


Example with dataframe. 
======================
type:sq1
Get some URLs, e.g. top articles of the day
```{r}
url <- "http://www.theguardian.com/uk"
  SOURCE <-  getURL(url,encoding="UTF-8") 
  PARSED <- htmlParse(SOURCE)
  urls <- xpathSApply(PARSED, "//div[@class='tabs-container']//*/@href")

#This is a bit tricky, apologies. There may be a better way
d <- lapply(urls,guardianScraper)
data <- data.frame(matrix(unlist(d),nrow=2,byrow=T))
colnames(data) <- c("title","author","time","tags","text")
as.character(data$tags)
```

Accessing this data later on
====================
the data.table packes is very useful for this, but it takes a bit of learning.

```{r}
require(stringr)
#return title of texts mentioning Chelsea
data[grep("Chelsea",data$tags),"title"]
#return tags of texts mentioning Chelsea
unlist(str_split(data[grep("Chelsea",data$tags),"tags"],"\\|"))
```

Practice time
================
type: section

write a scraper for:
- http://www.mirror.co.uk/
- http://www.telegraph.co.uk/
- http://www.independent.co.uk

Your own interests here?

Advanced stuff: comments
=======================
type:sq

Harder - find the iframe
```{r}
url <- "http://discussion.theguardian.com/discussion/p/3n34d"
  SOURCE <-  getURL(url,encoding="UTF-8") # Specify encoding when dealing with non-latin characters
  PARSED <- htmlParse(SOURCE)
  xpathSApply(PARSED, "//div[@class='d2-body'][1]",xmlValue)
  xpathSApply(PARSED, "//a[@class='d2-username']",xmlValue)

```


Comments2
================
type:sq2
So, we need a way to find the iframe. But the iframe is loaded by a script. 
```{r}
url <- "http://www.theguardian.com/commentisfree/2014/feb/25/how-much-cost-growers-bananas-68p-per-kilo"
  SOURCE <-  getURL(url,encoding="UTF-8")
  PARSED <- htmlParse(SOURCE)
  links <- xpathSApply(PARSED, "//@href")
  shortUrl <- links[grep("//gu\\.",links)][1]
  require(stringr)
  temp <- unlist(str_split(shortUrl,"/"))
  target <- temp[length(temp)]

  discussionURL <- paste0("http://discussion.theguardian.com/discussion/p/",target)
  SOURCE <-  getURL(discussionURL,encoding="UTF-8") 
  PARSED <- htmlParse(SOURCE)
  xpathSApply(PARSED, "//a[@class='d2-username']",xmlValue)
```


Why might this be useful?
===============
Studying reception (e.g. sentiment)
Assessing how controversial a subject is (does anyone care enough to comment)

Who always comments?
- Detecting spammers or paid commenters (e.g. articles about Russia)


Solutions (1: Mirror)
==================
type:sq1

Can you turn these into scraper functions?
Mirror
```{r}
#MIRROR
url <- "http://www.mirror.co.uk/news/world-news/oscar-pistorius-trial-murder-reeva-3181393"
SOURCE <-  getURL(url,encoding="UTF-8") 
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1",xmlValue)
author <- xpathSApply(PARSED, "//li[@class='author']",xmlValue)
time  <- xpathSApply(PARSED, "//time[@itemprop='datePublished']/@datetime")
```

Telegraph
============
type:sq1

```{r}
#Telegraph
url <- "http://www.telegraph.co.uk/news/uknews/terrorism-in-the-uk/10659904/Former-Guantanamo-detainee-Moazzam-Begg-one-of-four-arrested-on-suspicion-of-terrorism.html"
SOURCE <-  getURL(url,encoding="UTF-8") 
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1[@itemprop='headline name']",xmlValue)
author <- xpathSApply(PARSED, "//p[@class='bylineBody']",xmlValue)
time  <- xpathSApply(PARSED, "//p[@class='publishedDate']",xmlValue)
```

Independent
==============
type:sq1
```{r eval=F}
#Independent
url <- "http://www.independent.co.uk/news/world/asia/leopard-on-the-loose-puts-indian-city-of-meerut-on-lockdown-9151273.html"
SOURCE <-  getURL(url,encoding="UTF-8")
PARSED <- htmlParse(SOURCE)
title <- xpathSApply(PARSED, "//h1",xmlValue)
author <- xpathSApply(PARSED, "//span[@class='authorName']",xmlValue)
time  <- xpathSApply(PARSED, "//p[@class='dateline']",xmlValue)
```


Finally
============
What shall we do next week?

More work on this? 

Work with APIs?

Look at gathering links for scraping, downloading files, etc?




<!-- CSS formatting used in these slides -->

<style>.s1 .reveal .state-background {
  background: #E0E0FF;
} 

.sq1 .reveal section code {
  font-size:145%;
}
.sq1 .reveal section p {
  font-size:100%;
}


.sq .reveal section code {
  font-size:125%;
}
.sq .reveal section p {
  font-size:85%;
}


.sq2 .reveal section code {
	font-size:100%;
}
.sq2 .reveal section p {
  font-size:70%;
}
.reveal blockquote {
  display: block;
  position: relative;
  width: 100%;
  margin: 5px auto;
  padding: 5px;

  font-style: normal;
  background: #C6D7DC;
  border: 1px solid #C6D7DC;
  box-shadow: none;
}

.reveal pre {   
  margin-top: 0;
  max-width: 100%;
  width: 100%;
  border: 1px solid #ccc;
  white-space: pre-wrap;
  margin-bottom: 1em; 
}

.reveal pre code {
/*  display: block; padding: 0.5em;
*/  font-size: 1.6em;
  line-height: 1.1em;
  background-color: white;
  overflow: visible;
  max-height: none;
  word-wrap: normal;
}

.reveal section centered {
	text-align: center;
   border: none;
}
</style>