```{r setup, include=FALSE}
opts_chunk$set(cache=TRUE)
```
Web Scraping - getting started
========================================================
width: 1200
author: Rolf Fredheim and Aiora Zabala
date: University of Cambridge
18/02/2014
Logging on
========================================================
type: s1
Before you sit down:
- Do you have your MCS password?
- Do you have your Raven password?
- If you answered **'no'** to either then go to the University Computing Services (just outside the door) NOW!
- Are you registered? If not, see me!
The hard bit: get the slides
========================================================
type:sq2
download.file('https://raw.github.com/fredheir/WebScraping/master/Lecture1/p1.Rpres',
destfile='p1.Rpres')
OR: download.file('http://goo.gl/ulTloi', destfile='p1.Rpres')
OLD RSTUDIO:
markdown:
download.file('https://raw.github.com/fredheir/WebScraping/master/Lecture1/p1.Rpres',
destfile='p1.Rmd')
slides in browser. Open:
http://fredheir.github.io/WebScraping/Lecture1/p1.html
Open the file **p1.Rpres** or **p1.Rmd** in RStudio, and press **preview** to view the slides
for more info about editing these slides, go [here](http://www.rstudio.com/ide/docs/presentations/overview)
Who is this course for
===============
Computer scientists
Anyone with some minimal background in coding and good computer literacy
By the end of the course you will have
==============
Created a system to extract text and numbers from a large number of web pages
Learnt to harvest links
Worked with an API to gather data, e.g. from YouTube
Convert messy data into tabular data
What will we need?
==============
A windows Computer
A modern browser - Chrome or Firefox
~~An up to date version of Rstudio~~
Getting help
============
- ?[functionName]
- StackOverflow
- Ask each other.
Outline
========================================================
type:section
**Theory**
Practice
What is 'Web Scraping'?
========================================================
From [Wikipedia](http://en.wikipedia.org/wiki/Web_scraping)
> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.
When might this be useful? (your examples)
-
-
-
-
Imposing structure on data
=========
Again, from [Wikipedia](http://en.wikipedia.org/wiki/Web_scraping)
> ... Web scraping focuses on the **transformation of unstructured data** on the web, typically in HTML format, into structured data that can be stored and analyzed in **a central local database or spreadsheet**.
What will we do today?
==============
```{r eval=F}
a<-fetchData(1,c("Web_scraping","Data_mining"))
ggplot(a,aes(date,rd.views,colour=target))+geom_line()
```
What will we learn?
====================
1) manipulating URLs
2) downloading stuff from within R
3) automate collection through loops and functions
Example
=======
Approximate number of web pages
Tabulate this data
======
```{r}
require (ggplot2)
clubs <- c("Tottenham","Arsenal","Liverpool",
"Everton","ManU","ManC","Chelsea")
nPages <- c(23.3,68.4,78.9,35.5,102,90.5,110)
df <- data.frame(clubs,nPages)
df
```
Visualise it
=======
```{r fig.width=30}
ggplot(df,aes(clubs,nPages,fill=clubs))+
geom_bar(stat="identity")+
coord_flip()+theme_bw(base_size=70)
```
Health and Safety
=====================
Programming with Humanists: Reflections on Raising an Army of Hacker-Scholars in the Digital Humanities
http://openbookpublishers.com/htmlreader/DHP/chap09.html#ch09
Bandwidth
=================
***
> the agent machines (slave zombies) begin to send a large volume of packets to the victim, flooding its system with useless load and exhausting its resources.
source: cisco.com
We will not:
- run parallel processes
we will:
- test code on minimal data
Practice
==============
type:section
- **The URL**
- Loops
- Scraping
The URL
=============
http://stats.grok.se/
http://stats.grok.se/en/201401/web_scraping
- en
- 201401
- web_scraping
en.wikipedia.org/wiki/Web_scraping
Changes by hand
=====
http://stats.grok.se/en/201301/web_scraping
http://stats.grok.se/en/201402/web_scraping
http://stats.grok.se/en/201401/data_scraping
'this page is in json format'
The JSON data
==================
http://stats.grok.se/json/en/201401/web_scraping
{"daily_views": {"2013-01-12": 542, "2013-01-13": 593, "2013-01-10": 941, "2013-01-11": 798, "2013-01-16": 1119, "2013-01-17": 1124, "2013-01-14": 908, "2013-01-15": 1040, "2013-01-30": 1367, "2013-01-18": 1027, "2013-01-19": 743, "2013-01-31": 1151, "2013-01-29": 1210, "2013-01-28": 1130, "2013-01-23": 1275, "2013-01-22": 1131, "2013-01-21": 1008, "2013-01-20": 707, "2013-01-27": 789, "2013-01-26": 747, "2013-01-25": 1073, "2013-01-24": 1204, "2013-01-01": 379, "2013-01-03": 851, "2013-01-02": 807, "2013-01-05": 511, "2013-01-04": 818, "2013-01-07": 745, "2013-01-06": 469, "2013-01-09": 946, "2013-01-08": 912}, "project": "en", "month": "201301", "rank": -1, "title": "web_scraping"}
Outline
========================================================
type:section
Theory
**Practice**
Questions
===================
type:section
1) how do we read the data from this page
2) how do we generate a list of links, say for the period 2012-2013?
Practice
==============
type:section
- The URL
- **Scraping**
- Loops
Paste
==================
Check out ?paste if you are unsure about this
Bonus: check out ?paste0
```{r}
var=123
paste("url",var,sep="")
paste("url",var,sep=" ")
```
Paste2
==================
```{r}
var=123
paste("url",rep(var,3),sep="_")
paste(rep("url",3),var,sep="_")
var=c(123,421)
paste(var,collapse="_")
```
With a URL
===========================
type:sq
```{r}
var=201401
paste("http://stats.grok.se/json/en/",var,"/web_scraping")
paste("http://stats.grok.se/json/en/",var,"/web_scraping",sep="")
```
Task using 'paste'
==============
type:sq
a="test"
b="scrape"
c=94
merge variables a,b,c into a string, separated by an underscore ("_")
> "test_scrape_94"
merge variables a,b,c into a string without any separating character
> "testscrape94"
print the letter 'a' followed by the numbers 1:10, without a separating character
> "a1" "a2" "a3" "a4" "a5" "a6" "a7" "a8" "a9" "a10"
Testing a URL is correct in R
==============
Run this in your terminal:
var=201401
url=paste("http://stats.grok.se/json/en/",var,"/web_scraping",sep="")
url
browseURL(url)
Fetching data
==================
type:sq1
```{r}
var=201401
url=paste("http://stats.grok.se/json/en/",var,"/web_scraping",sep="")
raw.data <- readLines(url, warn="F")
raw.data
```
Fetching data2
==================
```{r}
#install.packages("rjson")
require(rjson)
rd <- fromJSON(raw.data)
rd
```
Fetching data3
==================
```{r}
rd.views <- rd$daily_views
rd.views
```
Fetching data4
==================
```{r}
rd.views <- unlist(rd.views)
df <- as.data.frame(rd.views)
df
```
Put it together
===================
```{r}
rd <- fromJSON(readLines(url, warn="F"))
rd.views <- rd$daily_views
df <- as.data.frame(unlist(rd.views))
```
Plot it
=================
type:sq
```{r}
require(ggplot2)
require(lubridate)
df$date <- as.Date(rownames(df))
colnames(df) <- c("views","date")
ggplot(df,aes(date,views))+
geom_line()+
geom_smooth()+
theme_bw(base_size=20)
```
Tasks
====================
type:section
Plot Wikipedia page views in February 2014 for the Sochi games
How does this compare to the number of views on the Russian language wikipedia page?
Moving on
=========================
If you are comfortable with loops and functions, go ahead and write a great application.
The rest of us are going to learn about loops
Practice
==============
type:section
- The URL
- Scraping
- **Loops**
Briefly about functions
===============
type:sq
```{r}
plusOne <- function(x){
return(x+1)
}
plusOne2 <- function(num){
return(num+1)
}
```
- Curly brackets {} include the code to be executed
- Normal brackets () contain a list of variables
****
```{r}
plusOne(8)
plusOne2(10)
plusOne2(num=5)
#plusOne2(wrongVar=2)
```
Simple loops
=============
```{r}
for (number in 1:5){
print (number)
}
```
Looping over functions
========================
type:sq
```{r}
a <- c(1,2,3,4,5)
for (value in a){
print (
plusOne(value)
)
}
```
***
```{r}
listOfNumbers <- c(1,2,3,4,5)
for (number in listOfNumbers){
print (
number+1
)
}
```
More loops
========================
type:sq
```{r}
a <- c(1,2,3,4,5)
a[1] #The first number in the vector
a[4] #The fourth number in the vector
```
***
```{r}
for (i in 1:length(a)){
print (
plusOne(a[i])
)
}
```
What are we looping over
=================
type:sq
In each case we are performing an operation on the vector in brackets.
See:
```{r}
#for (i in 1:length(a))
print(1:length(a))
#for (i in length(a))
print (length(a))
#for (i in a)
print (a)
```
Vectors
=============
```{r}
a <- c(1,2,3,4,5) #Least flexible, fastest
a+1
plusOne(a) #Quite flexible
sapply(a,plusOne) #Can be used in all sorts of situations, slow - similar to a loop, better if you are colleting an output
```
Urls again
==========
type:sq1
stats.grok.se/json/en/**201401**/web_scraping
```{r}
for (month in 1:12){
print(paste(2014,month,sep=""))
}
```
Not quite right
================
type:sq
left:20
We need the variable 'month' to have two digits:
201401
***
```{r}
for (month in 1:9){
print(paste(2012,0,month,sep=""))
}
for (month in 10:12){
print(paste(2012,month,sep=""))
}
```
Store the data
=========
type:sq
left:60
```{r}
dates=NULL
for (month in 1:9){
date=(paste(2012,0,month,sep=""))
dates=c(dates,date)
}
for (month in 10:12){
date=(paste(2012,month,sep=""))
dates=c(dates,date)
}
print (as.numeric(dates))
```
***
here we concatenated the values:
```{r}
dates <- c(c(201201,201202),201203)
print (dates)
```
!! To do this with a **data.frame**, use **rbind()**
Add another variable
=========================
type:sq
We might want years to be a variable
Shows two loops:
```{r}
for (year in 2012:2013){
for (month in 1:9){
print(paste(year,0,month,sep=""))
}
for (month in 10:12){
print(paste(year,month,sep=""))
}
}
```
Putting it together
============================
type:sq
```{r}
for (year in 2012:2013){
for (month in 1:9){
print(paste("http://stats.grok.se/json/en/",year,0,month,"/web_scraping",sep=""))
}
for (month in 10:12){
print(paste("http://stats.grok.se/json/en/",year,month,"/web_scraping",sep=""))
}
}
```
Tasks about Loops
==================
type: section
- Write a loop that prints every number between 1 and 1000
- Write a loop that adds up all the numbers between 1 and 1000
- Write a function that takes an input number and returns this number divided by two
- Write a function that returns the value 99 no matter what the input
- Write a function that takes two variables, and returns the sum of these variables
Advanced tasks
=========================
type: section
- Can you make an application which takes a Wikipedia page (e.g. Web_scraping) and returns a plot for the month 201312
- Can you extend this application to plot data for the entire year 2013 (that is for pages 201301:201312)
- Can you expand this further by going across multiple years (201212:201301)
- Can you write the application so that it takes a custom data range?
- If you have time, keep expanding functionality: multiple pages, multiple languages. you could also make it interactive using [Shiny](http://www.rstudio.com/shiny/)
Getting started with functions
=============================
```{r}
getData <- function(url){
raw.data <- readLines(url, warn="F")
rd <- fromJSON(raw.data)
rd.views <- rd$daily_views
rd.views <- unlist(rd.views)
rd <- as.data.frame(rd.views)
rd$date <- rownames(rd)
rownames(rd) <- NULL
return(rd)
}
```
Reading
=============
http://www.bbc.co.uk/news/technology-23988890
http://blog.hartleybrody.com/web-scraping/
http://openbookpublishers.com/htmlreader/DHP/chap09.html#ch09
http://www.essex.ac.uk/ldev/documents/going_digital/scraping_book.pdf
https://software.rc.fas.harvard.edu/training/scraping2/latest/index.psp#(1)