```{r setup, include=FALSE}
opts_chunk$set(cache=TRUE)
```
Web Scraping part 2: Digging in
========================================================
width: 1200
author: Rolf Fredheim and Aiora Zabala
date: University of Cambridge
font-family: 'Rockwell'
25/02/2014
Today we will:
========================================================
- Become good at writing bad* functions
- Learn how to access information in web-pages (HTML, XML, etc)
Get the docs:
http://fredheir.github.io/WebScraping/Lecture2/p2.html
https://raw.github.com/fredheir/WebScraping/master/Lecture2/p2.Rpres
http://fredheir.github.io/WebScraping/Lecture2/p2.R
*mabye 'utilitarian', 'expedient', and 'functional' would be more accurate, if more pompous descriptions of this practice
Digital data collection
=======================
- Devise a means of accessing data
- Retrieve that data
- tabulate and store the data
Last week step two involved JSON.
Today we work with HTML.
Steps 1 and 3 do not change the same
Revisiting the programming basics
================
type:section
Good functions
========================================================
What makes a function good?
- Clear name
- Instructions to user
- Short
- Performs a single task
- Is efficient
- Can handle errors
- Is predictable
- Does not use **global variables** (someone explain?)
Bad functions
========================================================
... break the rules/ guidelines above. But they can be useful to:
- hide a script behind a function
- get overview
- move on to the next task
- don't worry about methods and error handling: simplifies the process
If writing functions for your own use, it's ok* to write bad functions.
*But basic notes throughout the code reminding yourself what you did will be invaluable
Revision
============
What is a variable?
What are they for?
Variables
=================
type:sq
Two main purposes:
quicker to write
```{r}
uni= "The University of Cambridge"
uni
```
***
quicker to change the code. It is good practice to declare variables near the start of your code
Paying tax:
9400 tax free
```{r}
(20000-9440)*20/100
#OR:
wage=20000
taxFree=9400
rate=20
(wage-taxFree)*rate/100
```
Functions without variables
================
```{r}
printName <- function(){
print ("My name is Rolf Fredheim")
}
printName()
```
This is a useless function. But sometimes, if we have many lines of code requiring no particular input, it can be useful to file them away like this.
e.g. for simulations
============
```{r}
sillySimulation <- function(){
x1 <- runif(500,80,100)
x2 <- runif(500,0,100)
v1 <- c(x1,x2)
x3 <- runif(1000,0,100)
df <- data.frame(v1,x3)
require(ggplot2)
print(ggplot(df, aes(v1,x3))+geom_point()+ggtitle("simulation of some sort"))
}
```
=====
Just as this slide hides the code on the previous slide, so the function hides the underlying code.
```{r}
sillySimulation()
```
Inserting variables
=========
Let's hammer home how to use variables
what variables could we add to the function below?
```{r}
desperateTimes <- function(){
print(paste0("Rolf is struggling to finish his PhD on time. Time remaining: 6 months"))
}
```
Name
===========
```{r}
desperateTimes <- function(name){
print(paste0(name ," is struggling to finish his PhD on time. Time remaining: 6 months"))
}
desperateTimes(name="Tom")
```
Gender
===========
type:sq
we specify a default value
```{r}
desperateTimes <- function(name,gender="m"){
if(gender=="m"){
pronoun="his"
}else{
pronoun="her"
}
print(paste0(name ," is struggling to finish ",pronoun," PhD on time. Time remaining: 6 months"))
}
desperateTimes(name="Tanya",gender="f")
```
Is this a good function? Why (not)?
degree
==============
```{r}
desperateTimes <- function(name,gender="m",degree){
if(gender=="m"){
pronoun="his"
}else{
pronoun="her"
}
print(paste0(name ," is struggling to finish ",pronoun," ",degree," on time. Time remaining: 6 months"))
}
desperateTimes(name="Rolf",gender="m","Mphil")
```
Days til deadline
============
type:sq1
```{r}
require(lubridate)
require(ggplot2)
deadline=as.Date("2014-09-01")
daysLeft <- deadline-Sys.Date()
totDays <- deadline-as.Date("2011-10-01")
print(daysLeft)
print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft)))
```
part2
==========
type:sq
```{r}
print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100)))
df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed"))
ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill")
```
===========
type:sq1
We could put all this code in a function, and forget about it
```{r}
timeToWorry <- function(){
require(lubridate)
deadline=as.Date("2014-09-01")
daysLeft <- deadline-Sys.Date()
totDays <- deadline-as.Date("2011-10-01")
print(daysLeft)
print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft)))
print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100)))
df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed"))
ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill")
}
```
File it away until in need of a reminder
======
```{r}
timeToWorry()
```
Finishing up last week's material
=============================
type:section
What does this have to do with webscraping?
============
Bad functions like this will help us to break the task into bitesize chunks
Rather than working with long unruly scripts, we write a little script that works, identify any necessary variables, and file it away. A typical structure might be:
- Load packages, set working directory
- Download one example
- Extract the necessary information
- Store the information
- Repeat.
-> either by looping, or by completing one step at a time
Last week's code
=================
type:sq
example and explanation
downloading data
Check the code is correct
```{r}
require(rjson)
url <- "http://stats.grok.se/json/en/201201/web_scraping"
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
rd.views <- rd$daily_views
rd.views <- unlist(rd.views)
rd <- as.data.frame(rd.views)
rd$date <- rownames(rd)
rownames(rd) <- NULL
rd
```
Turn it into a function
========================
type:sq
"url" is the only thing that changes. Thus we have one variable
At the end we "return" the data to the user
```{r}
getData <- function(url){
require(rjson)
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
rd.views <- rd$daily_views
rd.views <- unlist(rd.views)
rd <- as.data.frame(rd.views)
rd$date <- rownames(rd)
rownames(rd) <- NULL
rd$date <- as.Date(rd$date)
return(rd)
}
```
Now we can forget about *how* we download data, after checking the code works:
getData("http://stats.grok.se/json/en/201201/web_scraping")
============
The script and the function achieve exactly the same thing.
But: compressing the code to a single function is good to relieve the brain, and to de-clutter your code
Creating the URLS
=========
type:sq1
```{r}
getUrls <- function(y1,y2,term){
root="http://stats.grok.se/json/en/"
urls <- NULL
for (year in y1:y2){
for (month in 1:9){
urls <- c(urls,(paste(root,year,0,month,"/",term,sep="")))
}
for (month in 10:12){
urls <- c(urls,(paste(root,year,month,"/",term,sep="")))
}
}
return(urls)
}
```
Put it together
======
type:sq2
```{r}
#create some URLS
urls <- getUrls(y1=2013,y2=2014,"Euromaidan")
#get data for each of them and store that data
results=NULL
for (url in urls){
results <- rbind(results,getData(url))
}
head(results)
```
Inspect the data,
visualise it
***
```{r}
ggplot(tail(results,100),aes(date,rd.views))+geom_line()
```
OK, lets move on
=============
type:section
- HTML
- XPath
- CSS and attributes
Getting to know HTML structure
==============================
http://en.wikipedia.org/wiki/Euromaidan
Let's look at this webpage
- Headings
- Images
- links
- references
- tables
To look at the code (in Google Chrome), right-click somewhere on the page and select 'inspect element'
Tree-structure (parents, siblings)
Back to Wikipedia
====================
HTML tags.
They come in pairs and are surrounded by these guys:
<>
e.g. a heading might look like this:
\