Two ways to scrape
Intro to scraping using an API
Rectangling or Simplifying lists
Some of the content of today's slides is also in the previous lecture. I copied everything here for clarity.
Interface provided by the website that allows users to collect data from that website.
The majority of web APIs use a particular style know as REST or RESTful which stays for "Representational State Transfer." This style allows to query the website database using URLs, just like you would construct an URL to view a web page.
An URL (Uniform Resource Location), is a string of characters that uses HTTP (HyperText Transfer Protocol) to send request for data.
The process described in the macss
example (previous lectures) is very similar to how APIs work, with only a few changes:
The URL that you use for sending requests (queries) to an API will have two more things compare to the request you send using direct scraping: (1) search terms and/or filtering parameters, and (2) specific references to that API
The response you get back from the API is not formatted as HTML, but it is usually formatted as raw text
You need R to interact with the API. For example:
.csv
or .json
When using an API in R, there are two approaches:
1. With an API wrapper:
A wrapper is a specific package written in a given language, like R, for an existing API (tailored to that API only)
Someone took the time to write a bunch of functions to interact with a specific API and put them together in a package. which we refer to as "R wrapper" for a specific API
Useful because: reproducible, up-to-date (ideally), easy to access
How to use: each package should come with documentation (in the form of a pdf, GitHub repo, or both) that explains how to use it and its main functions. Reading the documentation is essential to understand how to effectively use the package.
Example: Wordbank API with wbstats
R wrapper package which returns results in a tidy dataframe
When using an API in R, there are two approaches:
2. Without an API wrapper, aka direct API interaction:
If no R wrapper exists, you can directly use the API provided by the website. In this case, you use R to communicate directly with the website's API. This is generally more difficult.
Example: OMDb Movies example API without a wrapper https://www.omdbapi.com/
Sometimes both approaches are available. Other times, only approach 2 is available.
Tip: Scraping with an API, compared to direct scraping, can be a hit-or-miss experience. The keys to success are using a well-designed R wrapper if available, or relying on well-documented API documentation.
The class materials for today (download it from the website) include:
Four examples of APIs that have a wrapper package:
One example of an API without a wrapper:
We examine the first two in class. The others are available for you to explore!
Wordbank database API: wbstats
R wrapper package
wbstats
is an R wrapper that simplify using the API!Let's move to today's in-class materials to learn more about this API!
The Wordbank API is free and does not require registration. However many APIs require you to register for access:
Why register for access? Registration allows APIs to track users, their queries, and manage demand.
If an API requires you to register and obtain a username, password, or key, you will need to provide this same information when using the corresponding R wrapper package.
GeoNames geographical database API:
geonames
package provides a wrapper for RLet's check today's in-class materials to learn how store username and other private info in R for this API!
There are two more examples of using an API with a wrapper included in your class materials:
Census Bureau API: tidycensus
R wrapper package
tidycensus
provides a R wrapper for the US Census Bureau’s Census and five-year American Community Survey APIsManifesto Project API: manifestoR
R wrapper package
manifestoR
package provides a wrapper for RThese tutorials are in today's class materials for you to explore if you'd like. Notice they both require you to register a free account!
OMDb stores info about movies, and currently does provide a very good API but not an R wrapper
Website: https://www.omdbapi.com/
See the api_omdb.Rmd
tutorial in today's in-class materials for walkthrough on how to use it!
In R "Rectangling" means transforming non-rectangular data (often nested lists) into a rectangular format (often a data frame).
The term is often associated with the tidyverse
and its principles of tidy data.
In R "Rectangling" means transforming non-rectangular data (often nested lists) into a rectangular format (often a data frame).
The term is often associated with the tidyverse
and its principles of tidy data.
In the context of web scraping, "rectangling" means transforming a deeply nested list (often obtained from raw JSON or XML data) into a tidy data set with rows and columns that is easier to work with!
If you need to simplify nested lists check today's class materials for a tutorial and examples of rectangling!
API pros:
API cons:
Scraping pros:
Scraping cons:
Two ways to scrape
Intro to scraping using an API
Rectangling or Simplifying lists
Some of the content of today's slides is also in the previous lecture. I copied everything here for clarity.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |