MACS 30500 LECTURE 14

---

# Agenda

* Web Scraping Intro: 
  * Definition
  * Two ways to scrape data from the web: directly or via API
  * Our goal today: scrape U.S. Presidential Statements

* Two Key Concepts:
  * How the Web Works
  * What is Behind a Website

+ The `rvest` package

<!-- TO IMPROVE FOR NEXT TIME NOTES FALL 2024

Keep starting from final product as R script

Cover the sys.sleep in this lecture especially if they plan to scrape lots of things

Mention more clearly what you can do with the data you scraped from the US Presidential especially the text. In this case, count since I plan to scrape 4000+ letters. Remember to adjust the in-class tutorials with the challenge and change lang-age. Assign challenge as homework?

Cover more clearly how to navigate html tag structure
* add one example of code in the slides (like use rvest syntaxt to scrape github text and link from there)
* take example from tutorial 2 on scraping on tags and introduce it here
* say that scraping can be messy bcs websites are not always nicely buit

Go over the other in-class example with the housing website data so students can have more ideas on what to scrape

Introduce in the slides html_table() as some students might need to use it

Maybe replace the API scraping with text mining and get rid of it altogether? Or briefly introduce it in lecture 2 on scraping?

Push deadline to Friday since the beginning and say upfront that students might need lecture 2 on Monday to fully complete this hw, but can get started
-->

---

## Web Scraping Intro

---

## What is web scraping?

**Web scraping is the process of collecting information from a website.** If you've ever copied and pasted info from the Internet, you've done the same task as a scraper, just manually and on a small scale. Web scraping automates and scales up this process!

**Examples:**

* companies' names, emails, and phone numbers 
* newspaper articles 
* products' prices, descriptions, and characteristics 
* real estate data
* social media (comments, likes, followers, customer reviews, etc.)
* selling or exchange of goods websites (craigslist, etc.)
* forums
* etc.

---

Image Source at [this link](https://medium.com/analytics-vidhya/web-scraping-in-python-for-data-analysis-6bf355e4fdc8)

---

## Two ways to scrape data from the web

<br>

### 1. Directly scraping the website

* Every website is built with code, typically a mix of HTML, CSS, and JavaScript.
* To collect data from a website, we need to learn how to interact with this code.
* Example: in theory, anything! in practice, start simple like Wikipedia or government sites.

---

## Two ways to scrape data from the web

<br>

### 2. Using a web API (Application Programming Interface)

* Interface provided by the website that allows users to collect data from it.
* To collect data from a website using an API, you need to learn how to use that API.
* Example: [OMDb API](https://omdbapi.com/), see chapter 4 from today's readings.

<!--
## Our plan

**Today** we focus on directly scraping websites: 
* we cover some theory and key concepts
* we practice!

**Next lecture** we focus on scraping using APIs.
-->

---

### Our goal today: Scraping U.S. Presidential Statements

Before diving into web scraping concepts, let's first observe today's end goal:

**U.S. Presidential statements <https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration>**

From this page, we will scrape the following info:
* speaker name
* date
* title
* text

---

## Two Key Concepts...

**Concept 1: How the Web Works** (e.g., how computers interact with each other on the web)

**Concept 2: What is Behind a Website** (e.g., understand what makes up a website)

*These theoretical concepts apply to any scraping project, not just R!*

---

## Concept 1: How the Web Works

---

### How do computer interact on the web? Theory

Computers communicate on the web by sending and receiving **data requests** (GET) and **data responses** (POST). Every computer has an address that other computers can refer to.

When you click on a webpage, your **web browser** (e.g., Chrome, Safari) sends a data request to the **web server** of that page (a database where all the info about that page are stored) and receives a response.

Image Source [at this link](https://www.linkedin.com/pulse/what-happens-when-you-enter-url-browser-he-asked-victor-ohachor)

---

### How do computer interact on the web? Example

Navigating the web basically means **sending requests to different servers and receiving back various files**, mainly written in HTML and other languages.

For example, if you type `https://macss.uchicago.edu/current-student-resources` into your web browser and hit enter, these steps occur behind the scenes:

1. your web browser (Chrome, Safari, etc.) translates what you typed into a HTTP request. That request tells the web server that stores all `macss` info that you want to access the specific piece of info stored at `/current-student-resources`

1. the web server that hosts `macss` receives your request and sends back to your web browser a response code (200 for yes, etc.) and the response content (files written in HTML)

1. your browser transforms the response content into a nice visual display that might include texts, graphics, hyperlinks, etc.

**When we scrape data, we use R packages that perform these steps for us (via APIs or without)!**

---

## Concept 2: What is Behind a Website

---

### What is Behind a Website

A website is made of the following elements:

+ **HTML** *HyperText Markup Language*, is the core element of a website. HTML uses tags to organize the webpage (i.e., creates paragraphs, inserts hyperlinks, etc.), but when the page is displayed the markup language is hidden

+ **CSS** *Cascading Style Sheets*: improves the look of the page

+ **Javascript** *JS* code adds interactive elements to the page (you need "dynamic web scraping" techniques to scrape JS code, not covered in this course)

+ **Other stuff** such as images, hyperlinks, videos or multimedia

---

### What is Behind a Website

When we scrape a website directly: knowing how read the HTML (and CSS) language, is fundamental

When we use an API and/or an API wrapper: less important, but still useful.

Our goal: lean how to read HTML and CSS languages

---

### HTML: HyperText Markup Language

* most important thing for web scraping: makes the "skeleton" or structure of a website
* it is made of tags 
* looks messy, but follows a hierarchical [tree-like structure](https://www.researchgate.net/figure/HTML-source-code-represented-as-tree-structure_fig10_266611108) of tags within tags

Standard HTML syntax:
```
   <html>
     <head>
        <title>general info about the page</title>
     </head>
     <body>
       <p>a paragraph with some text about the page</p>
       <p>another paragraph with more text</p>
       <p>ciao</p>
     </body>
   </html>
   
```

---

### HTML Tags

HTML tags are organized in a **tree-like structure** and are nested within each other

HTML tags **go in pairs** one on each end of the content that they display. For example: `<p>ciao</p>` only the word "ciao" shows up on the webpage, the `/` signals the end of the tag

HTML tags can have **attributes** which provide additional info about the tag. For example: `<p>ciao id='first'</p>`
   * `<p>` is the tag
   * `<id>` is the attribute
   * two important attributes for scraping: `id` and `class`; they are used by CSS elements to control the visual appearance of the page
    
--

Let's examine the HMTL tags structure more closely!

---

### HTML Tags

* `html`
    * `head`
        * `title`
        * `a href`
        * `script`
    * `body`
        * `div`
            * `p`
                * `b`
            * `span`
        * `table`
            * `tr`
                * `td`
                * `td`
        * `img`

---

### HTML Tags

* `html`
    * `head`
        * `title` - title
        * `a href` - links
        * `script` - code
    * `body`
        * `div` - generic container for content (block)
            * `p` - paragraph
                * `b` - bold formatting
            * `span` - generic container for content (in line)
        * `table`
            * `tr` - row of cells
                * `td` - actual cell element 
        * `img` - image

---

### HTML Tags: example

.small[
```html
<html>
  <head>
    <title>Title</title>
    <a href="http://github.com">GitHub</a>
    <script src="https://c.js"></script>
  </head>
  <body>
    <div>
      <p>Click <b>here</b> now.</p>
      <span>Frozen</span>
    </div>
    <table style="width:100%">
      <tr>
        <td>Kristen</td>
        <td>Bell</td>
      </tr>
    </table>
  <img src="http://ia.media-imdb.com/images.png"/>
  </body>
</html>
```
]
Find: (1) the text content `Frozen`; (2) the GitHub link and the text content `GitHub`

---

### HTML Tags: example

**Find the text content `Frozen`:**

```html
<span>Frozen</span>
```

* `<span></span>` - tag name
* `Frozen` - content embedded in the tag as text

**Find the GitHub link and text `GitHub` embedded within it:**

```html
<a href="http://github.com">GitHub</a>
```

* `<a></a>` - tag name
* `href` - tag attribute (argument)
* `"http://github.com"` - tag attribute (value)
* `GitHub` - content as text

---

### CSS: Cascading Style Sheet

Often a website has HTML tags (with attributes) and CSS elements:
* **HTML defines the content** 
* **CSS defines the format**

This is an example of CSS code. Notice: (1) it is different from HTML code; and (2) the `span` HTML tag can be styled using CSS `color`:

```css
span {
  color: #ffffff; }
```
--

**Most websites use HTML tags with `class` and `id` attributes to provide “hooks” for CSS**. This way the CSS "knows" where to apply CSS stylistic elements.

Example: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/class

<!-- https://plsc-31101.github.io/course/collecting-data-from-the-web.html#webscraping 
[CSS diner](http://flukeout.github.io)

### HTML + CSS: example

```html
<body>
    <table id="content">
        <tr class='name'>
            <td class='firstname'>
                Kurtis
            </td>
            <td class='lastname'>
                McCoy
            </td>
        </tr>
        <tr class='name'>
            <td class='firstname'>
                Leah
            </td>
            <td class='lastname'>
                Guerrero
            </td>
        </tr>
    </table>
</body>
```

]

Find the elements to use to extract:
1. The entire table
1. The general element(s) containing first names
1. Just the specific element "Kurtis"

]

-->

---

## The `rvest` package in R

---

### Using `rvest`

**The R package [`rvest`](`https://rvest.tidyverse.org/`) is a package designed for scraping project and allows us to:**

1. Read and save in an object the HTML source code of a webpage

1. Find the specific HTLM and CSS elements from a webpage (using HTML tags or attributes, and using CSS selectors)

---

### Main `rvest` functions

+ **`read_html()`**
+ **`html_elements()`**
+ **`html_text2()`**
+ **`html_attr()`** 
+ **`html_table()`**

Source: https://rvest.tidyverse.org/index.html#usage

---

### Main `rvest` functions

**`read_html()`** reads an html page into R:
```
library(rvest)
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")
```

**`html_elements()`** finds elements that match a CSS selector or XPath expression. Here each section corresponds to a different film:
```
films <- starwars %>% html_elements("section")
films
```

---

### Main `rvest` functions

**`html_text2()`** extracts text embedded within tags. Here the title is given by the text inside `<h2>`:
```
title <- films %>% 
  html_element("h2") %>% 
  html_text2()
title
```

**`html_attr()`** gets data out of attributes; always returns a string so we convert it to an integer using a `readr::parse_integer()`
```
episode <- films %>% 
  html_element("h2") %>% 
  html_attr("data-id")
episode
```

---

## Practice all of this in R with `rvest`

**We scrape this U.S. Presidential statements page: <https://www.presidency.ucsb.edu/documents/special-message-the-congress-relative-space-science-and-exploration>**

Through this example, we learn the following:
* use `rvest` functions to collect name, date, title, and text from that page
* test two methods to identify tags: examining the webpage directly, and using tools that assist us (e.g., SelectorGadget)
* write a function to scale up our scraper to handle multiple pages

Download today's class materials to follow along!

---

### Web Scraping Resources

* `rvest` documentation and tutorials:
  * Web scraping 101: <https://rvest.tidyverse.org/articles/rvest.html>
  * Tutorial: <https://rvest.tidyverse.org/articles/rvest.html#html-basics>

* Using HTML and CSS for scraping: https://sscc.wisc.edu/sscc/pubs/webscraping-r/html.html

* Handy references for your scraping projects:

* HTML overview: https://www.w3schools.com/html/html_intro.asp
  * List of tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element