class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 16 ] .author[ ### Topics: Getting data from the web: API. ] --- class: inverse, middle <!-- TO IMPROVE FOR NEXT TIME NOTES FALL 2024 Maybe drop this as a standalone lecture and incorporate 20 min of it in the scraping 2 lecture. Instead than this lecture add a text mining in R intro class which could be more useful for HW7 and potentially Hw6 (although do no make it required bcs covered too late) Leave the material available for students. Leave last class as check in for students project no vote. Can have presentatoin instead --> ## Agenda + Two ways to scrape + Intro to scraping using an API + Terminology and sending queries to APIs + APIs with and without a wrapper package + Registering for access + API or direct scraping? + Rectangling or Simplifying lists *Some of the content of today's slides is also in the previous lecture. I copied everything here for clarity.* --- class: inverse, middle ## Two ways to scrape --- ## Two ways to scrape <br> ### 1. Directly scraping the website * Every website is built with code, typically a mix of HTML, CSS, and JavaScript. * To collect data from a website, we need to learn how to interact with this code. * Example: in theory, anything! in practice, start simple like Wikipedia or government sites. -- ### 2. Using a web API (Application Programming Interface) * Interface provided by the website that allows users to collect data from it. * To collect data from a website using an API, you need to learn how to use that API. * Example: [OMDb API](https://omdbapi.com/), see chapter 4 from today's readings. --- class: inverse, middle ## Scraping using an API: Application Programming Interface --- ### API: terminology > Interface provided by the website that allows users to collect data from that website. The majority of web APIs use a particular style know as **REST** or **RESTful** which stays for "Representational State Transfer." This style allows to query the website database using URLs, just like you would construct an URL to view a web page. An **URL** (Uniform Resource Location), is a string of characters that uses HTTP (HyperText Transfer Protocol) to send request for data. --- ### Sending queries to an API The process described in the `macss` example (previous lectures) is very similar to how APIs work, with only a few changes: * The URL that you use for sending requests (queries) to an API will have two more things compare to the request you send using direct scraping: (1) search terms and/or filtering parameters, and (2) specific references to that API * The response you get back from the API is not formatted as HTML, but it is usually formatted as raw text * You need R to interact with the API. For example: * parse that response, get the data from it * convert the data into a format that you like (dataframe, lists, etc.) * export the data, usually as `.csv` or `.json` --- ### API: with and without a wrapper package When using an API in R, there are two approaches: **1. With an API wrapper:** * A wrapper is a specific package written in a given language, like R, for an existing API (tailored to that API only) * Someone took the time to write a bunch of functions to interact with a specific API and put them together in a package. which we refer to as "R wrapper" for a specific API * Useful because: reproducible, up-to-date (ideally), easy to access * How to use: each package should come with documentation (in the form of a pdf, GitHub repo, or both) that explains how to use it and its main functions. Reading the documentation is essential to understand how to effectively use the package. * Example: Wordbank API with `wbstats` R wrapper package which returns results in a tidy dataframe --- ### API: with and without a wrapper package When using an API in R, there are two approaches: **2. Without an API wrapper, aka direct API interaction**: If no R wrapper exists, you can directly use the API provided by the website. In this case, you use R to communicate directly with the website's API. This is generally more difficult. Example: OMDb Movies example API without a wrapper ` https://www.omdbapi.com/` --- ### API: with and without a wrapper package Sometimes both approaches are available. Other times, only approach 2 is available. **Tip:** Scraping with an API, compared to direct scraping, can be a hit-or-miss experience. The keys to success are using a well-designed R wrapper if available, or relying on well-documented API documentation. --- ### Using an API with and without a wrapper package: Examples The class materials for today (download it from the website) include: * **Four examples of APIs that have a wrapper package:** * Wordbank * GeoNames * Census * Manifesto Project * **One example of an API without a wrapper:** * OMDb Open Movie Database **We examine the first two in class. The others are available for you to explore!** <!-- Not all APIs have an R wrapper package to facilitate interaction. If such a wrapper does not exist, we can interact directly with the API, though this can be though for beginner projects (depending on the API). Direct interaction typically requires more coding, such as writing custom functions, but it also offers greater flexibility to customize data requests. --> --- ### Using an API with a wrapper: Wordbank API example **Wordbank database API**: `wbstats` R wrapper package * socioeconomic indicators spanning several decades and numerous topics * full data are available for bulk download as CSV files from their website (see HW5); however, researcher often only need a handful of indicators or a subset of countries * thus the Wordbank database has an API and `wbstats` is an R wrapper that simplify using the API! *Let's move to today's in-class materials to learn more about this API!* --- ### Register for access: Overview The Wordbank API is free and does not require registration. However **many APIs require you to register for access:** * Sometimes, registration is as simple as providing an email and password, then receiving an email with your username and private API key. * Other times, you need to submit an application and go through a review process. * Often, this process is free, but some APIs require paying a fee. **Why register for access?** Registration allows APIs to track users, their queries, and manage demand. If an API requires you to register and obtain a username, password, or key, you will need to provide this same information when using the corresponding R wrapper package. --- ### Register for access: GeoNames example **GeoNames geographical database API**: * geographical information for all countries and other locations * the API requires you to register and set a username and key * the `geonames` package provides a wrapper for R *Let's check today's in-class materials to learn how store username and other private info in R for this API!* <!-- we are going to register to this API and show how to store our credientials in R --> --- ### Using an API with a wrapper: More examples There are two more examples of using an API with a wrapper included in your class materials: **Census Bureau API**: `tidycensus` R wrapper package * statistical data from the US Census Bureau * the `tidycensus` provides a R wrapper for the US Census Bureau’s Census and five-year American Community Survey APIs **Manifesto Project API**: `manifestoR` R wrapper package * political party manifestos from around the world * covers 1,000+ parties from 1945 until today in 50+ countries on five continents. * the `manifestoR` package provides a wrapper for R These tutorials are in today's class materials for you to explore if you'd like. Notice they both require you to register a free account! --- ### Using an API without a wrapper: OMDb Movies example **OMDb** stores info about movies, and currently does provide a very good API but not an R wrapper Website: https://www.omdbapi.com/ *See the `api_omdb.Rmd` tutorial in today's in-class materials for walkthrough on how to use it!* --- class: inverse, middle ## Rectangling or Simplifying lists --- ### Rectangling In R "Rectangling" means **transforming non-rectangular data (often nested lists) into a rectangular format (often a data frame).** The term is often associated with the `tidyverse` and its principles of tidy data. -- **In the context of web scraping,** "rectangling" means transforming a deeply nested list (often obtained from raw JSON or XML data) into a tidy data set with rows and columns that is easier to work with! If you need to simplify nested lists *check today's class materials for a tutorial and examples of rectangling!* --- class: inverse, middle ## API or direct scraping? --- ### Using an API VS. direct scraping API pros: * You comply with website preferences (if a website has an API, it wants you to use it) * Sometimes using the API is the only option you have (the website may make direct scraping difficult or impossible) * if the API has a good R wrapper, you can get data more easily than by scraping directly API cons: * API usually requires you to register * rate-limit * time invested to learn how to use the API (each website has its own API) --- ### Using an API VS. direct scraping Scraping pros: * can be powerful * basic rules can be applied to any scraping project Scraping cons: * inconsistent and messy * susceptible to site changes (e.g. your code for scraping will break) <!-- ### Acknowledgments APIs examples are drawn from Rochelle Terman’s "Finding APIs" page [here](https://plsc-31101.github.io/course/collecting-data-from-the-web.html#finding-apis) and Benjamin Soltoff’s “Computing for the Social Sciences” course materials, licensed under the CC BY NC 4.0 Creative Commons License. Any errors or oversights are mine alone. -->