class: center, middle, inverse, title-slide .title[ # MACS 30500 LECTURE 13 ] .author[ ### Topics: Working with Strings & Regular Expressions. ] --- class: inverse, middle # Agenda * Strings: definition * Regular Expressions: definition and uses * The `stringr()` package * Examples and practice (download today's class materials) <!-- we focus mostly on example and practice with Rmd file from in-class materials! --> --- class: inverse, middle ## Strings --- ### Introduction These are strings: ```r string1 <- "This is a string" string2 <- 'If I want to include a "quote" inside a string, I use single quotes' ``` -- #### Today * we focus on methods for manipulating strings: regular expressions! * readings for today <!-- Not 1:1 allignment but haver focus on Reg Expr because super important to know them! So formally a string is... --> --- ### String and Character Vector A **string** is a sequence of characters (a piece of text) enclosed in either double `"` or single quotes `'`. Strings can include letters, numbers, symbols, and whitespace characters. ```r string <- "Ciao, my name is Sabrina and I am 99 years old!!" class(string) ``` ``` ## [1] "character" ``` ```r length(string) ``` ``` ## [1] 1 ``` --- ### String and Character Vector A **character vector** is a collection of strings: each element of the vector is a string! ```r char_vect <- c("Ciao come stai?", "Ciao", "Hello", "etc.") class(char_vect) ``` ``` ## [1] "character" ``` ```r length(char_vect) ``` ``` ## [1] 4 ``` Thus single string, like in the previous examples, is also a character vector of length one. We use character vectors when we need to handle multiple pieces of text together (e.g., when we handle multiple strings vs. only one). --- class: inverse, middle ## Regular Expressions --- ### Regular Expressions: definitions and uses Regular expressions ("regex" or "regexes") **are strings containing normal characters and/or special meta-characters.** They describe a specific pattern to match in a given text. More formally, regex: * is a *language on its own right* * used for *pattern matching* * adopted by many programming languages, such as R, Python, and others! -- Our goal today is to decipher and write patterns like this: ```r email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}" ``` <!-- But then you might ask yourself... given our ability to manipulate strings and our ability to test for equivalence (with the == operator) or to test whether some string contains another (with the %in% operator), we don't technically need regular expressions... Not exactly! We want to use regex becasue they are POWERFUL --> --- ### Regular Expressions: definitions and uses Regular expressions are powerful! They allow us to find patterns **in any task that deals with text,** such as NLP (Natural Language Processing) or data-cleaning tasks that involve text. For example, you can use regular expressions to: * Extract specific elements from texts (e.g. dates, find words that include a given set of letters, find all past tenses in a text,) * Perform textual substitutions (e.g. find and replace HTML tags after you scraped a page) * Verify text format (e.g. "Is this email/phone-number/address valid?") Today we learn about the package `stringr` and regular expressions: particularly useful for those of you who plan to work with textual data in this course and beyond. --- class: inverse, middle ## Regular Expression with `stringr()` --- ### The `stringr()` package in R When you use regular expressions, most likely you will need to use your them together with one of the functions from the `stringr()` package. This package includes several functions that let you: detect matches in a string, count the number of matches, extract them, replace them with other values, split a string based on a match, etc. --- ### The `stringr()` package in R Fundamental `stringr()` functions and their use: * `str_view()` return the first regex match * `str_view_all()` return all regex matches (deprecated in the last version of `stringr`) * `str_detect()`: detect matches in a string * `str_count()`: count the number of matches * `str_extract()` and `str_extract_all()`: extract matches * `str_replace()` and `str_replace_all()`: replace matches * `str_split()`: split a string based on a match --- ### Resources * [`stringr()` documentation and cheatsheet](https://stringr.tidyverse.org/) * [Chapter 15](https://plsc-31101.github.io/course/strings-and-regular-expressions.html#applying-regex) by Rochelle Terman, explains `stringr()` (using the *R for Data Science* textbook) * Read [Chapter 14 "Strings"](https://r4ds.hadley.nz/strings) in "R for Data Science" 2nd Edition (read it all, but especially 14.4 "Extracting data from strings"). * Read [Chapter 17](https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html#the-stringr-package) from *R Programming for Data Science*. This book covers the entire range of regular expressions packages and functions: you do not need to understand everything, focus on the big picture. In-class we talk about `stringr()` * For more in-depth info on regular expressions check [this excellent tutorial](https://github.com/ziishaned/learn-regex/blob/master/README.md) --- class: inverse, middle ## Examples and Practice Download today's class material from the website to follow along