+ - 0:00:00
Notes for current slide
Notes for next slide

MACS 30500 LECTURE 16

Topics: Getting data from the web: API.

1 / 21

Agenda

  • Two ways to scrape

  • Intro to scraping using an API

    • Terminology and sending queries to APIs
    • APIs with and without a wrapper package
    • Registering for access
    • API or direct scraping?
  • Rectangling or Simplifying lists

Some of the content of today's slides is also in the previous lecture. I copied everything here for clarity.

2 / 21

Two ways to scrape

3 / 21

Two ways to scrape


1. Directly scraping the website

  • Every website is built with code, typically a mix of HTML, CSS, and JavaScript.
  • To collect data from a website, we need to learn how to interact with this code.
  • Example: in theory, anything! in practice, start simple like Wikipedia or government sites.
4 / 21

Two ways to scrape


1. Directly scraping the website

  • Every website is built with code, typically a mix of HTML, CSS, and JavaScript.
  • To collect data from a website, we need to learn how to interact with this code.
  • Example: in theory, anything! in practice, start simple like Wikipedia or government sites.

2. Using a web API (Application Programming Interface)

  • Interface provided by the website that allows users to collect data from it.
  • To collect data from a website using an API, you need to learn how to use that API.
  • Example: OMDb API, see chapter 4 from today's readings.
4 / 21

Scraping using an API: Application Programming Interface

5 / 21

API: terminology

Interface provided by the website that allows users to collect data from that website.

The majority of web APIs use a particular style know as REST or RESTful which stays for "Representational State Transfer." This style allows to query the website database using URLs, just like you would construct an URL to view a web page.

An URL (Uniform Resource Location), is a string of characters that uses HTTP (HyperText Transfer Protocol) to send request for data.

6 / 21

Sending queries to an API

The process described in the macss example (previous lectures) is very similar to how APIs work, with only a few changes:

  • The URL that you use for sending requests (queries) to an API will have two more things compare to the request you send using direct scraping: (1) search terms and/or filtering parameters, and (2) specific references to that API

  • The response you get back from the API is not formatted as HTML, but it is usually formatted as raw text

  • You need R to interact with the API. For example:

    • parse that response, get the data from it
    • convert the data into a format that you like (dataframe, lists, etc.)
    • export the data, usually as .csv or .json
7 / 21

API: with and without a wrapper package

When using an API in R, there are two approaches:

1. With an API wrapper:

  • A wrapper is a specific package written in a given language, like R, for an existing API (tailored to that API only)

  • Someone took the time to write a bunch of functions to interact with a specific API and put them together in a package. which we refer to as "R wrapper" for a specific API

  • Useful because: reproducible, up-to-date (ideally), easy to access

  • How to use: each package should come with documentation (in the form of a pdf, GitHub repo, or both) that explains how to use it and its main functions. Reading the documentation is essential to understand how to effectively use the package.

  • Example: Wordbank API with wbstats R wrapper package which returns results in a tidy dataframe

8 / 21

API: with and without a wrapper package

When using an API in R, there are two approaches:

2. Without an API wrapper, aka direct API interaction:

If no R wrapper exists, you can directly use the API provided by the website. In this case, you use R to communicate directly with the website's API. This is generally more difficult.

Example: OMDb Movies example API without a wrapper https://www.omdbapi.com/

9 / 21

API: with and without a wrapper package

Sometimes both approaches are available. Other times, only approach 2 is available.

Tip: Scraping with an API, compared to direct scraping, can be a hit-or-miss experience. The keys to success are using a well-designed R wrapper if available, or relying on well-documented API documentation.

10 / 21

Using an API with and without a wrapper package: Examples

The class materials for today (download it from the website) include:

  • Four examples of APIs that have a wrapper package:

    • Wordbank
    • GeoNames
    • Census
    • Manifesto Project
  • One example of an API without a wrapper:

    • OMDb Open Movie Database

We examine the first two in class. The others are available for you to explore!

11 / 21

Using an API with a wrapper: Wordbank API example

Wordbank database API: wbstats R wrapper package

  • socioeconomic indicators spanning several decades and numerous topics
  • full data are available for bulk download as CSV files from their website (see HW5); however, researcher often only need a handful of indicators or a subset of countries
  • thus the Wordbank database has an API and wbstats is an R wrapper that simplify using the API!

Let's move to today's in-class materials to learn more about this API!

12 / 21

Register for access: Overview

The Wordbank API is free and does not require registration. However many APIs require you to register for access:

  • Sometimes, registration is as simple as providing an email and password, then receiving an email with your username and private API key.
  • Other times, you need to submit an application and go through a review process.
  • Often, this process is free, but some APIs require paying a fee.

Why register for access? Registration allows APIs to track users, their queries, and manage demand.

If an API requires you to register and obtain a username, password, or key, you will need to provide this same information when using the corresponding R wrapper package.

13 / 21

Register for access: GeoNames example

GeoNames geographical database API:

  • geographical information for all countries and other locations
  • the API requires you to register and set a username and key
  • the geonames package provides a wrapper for R

Let's check today's in-class materials to learn how store username and other private info in R for this API!

14 / 21

Using an API with a wrapper: More examples

There are two more examples of using an API with a wrapper included in your class materials:

Census Bureau API: tidycensus R wrapper package

  • statistical data from the US Census Bureau
  • the tidycensus provides a R wrapper for the US Census Bureau’s Census and five-year American Community Survey APIs

Manifesto Project API: manifestoR R wrapper package

  • political party manifestos from around the world
  • covers 1,000+ parties from 1945 until today in 50+ countries on five continents.
  • the manifestoR package provides a wrapper for R

These tutorials are in today's class materials for you to explore if you'd like. Notice they both require you to register a free account!

15 / 21

Using an API without a wrapper: OMDb Movies example

OMDb stores info about movies, and currently does provide a very good API but not an R wrapper

Website: https://www.omdbapi.com/

See the api_omdb.Rmd tutorial in today's in-class materials for walkthrough on how to use it!

16 / 21

Rectangling or Simplifying lists

17 / 21

Rectangling

In R "Rectangling" means transforming non-rectangular data (often nested lists) into a rectangular format (often a data frame).

The term is often associated with the tidyverse and its principles of tidy data.

18 / 21

Rectangling

In R "Rectangling" means transforming non-rectangular data (often nested lists) into a rectangular format (often a data frame).

The term is often associated with the tidyverse and its principles of tidy data.

In the context of web scraping, "rectangling" means transforming a deeply nested list (often obtained from raw JSON or XML data) into a tidy data set with rows and columns that is easier to work with!

If you need to simplify nested lists check today's class materials for a tutorial and examples of rectangling!

18 / 21

API or direct scraping?

19 / 21

Using an API VS. direct scraping

API pros:

  • You comply with website preferences (if a website has an API, it wants you to use it)
  • Sometimes using the API is the only option you have (the website may make direct scraping difficult or impossible)
  • if the API has a good R wrapper, you can get data more easily than by scraping directly

API cons:

  • API usually requires you to register
  • rate-limit
  • time invested to learn how to use the API (each website has its own API)
20 / 21

Using an API VS. direct scraping

Scraping pros:

  • can be powerful
  • basic rules can be applied to any scraping project

Scraping cons:

  • inconsistent and messy
  • susceptible to site changes (e.g. your code for scraping will break)
21 / 21

Agenda

  • Two ways to scrape

  • Intro to scraping using an API

    • Terminology and sending queries to APIs
    • APIs with and without a wrapper package
    • Registering for access
    • API or direct scraping?
  • Rectangling or Simplifying lists

Some of the content of today's slides is also in the previous lecture. I copied everything here for clarity.

2 / 21
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow