MACS 30500 LECTURE 13

---

# Agenda

* Strings: definition

* Regular Expressions: definition and uses

* The `stringr()` package

* Examples and practice (download today's class materials)

---

## Strings

---

### Introduction

These are strings:

```r
string1 <- "This is a string"

string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```

#### Today

* we focus on methods for manipulating strings: regular expressions!
* readings for today

---

### String and Character Vector

A **string** is a sequence of characters (a piece of text) enclosed in either double `"` or single quotes `'`. Strings can include letters, numbers, symbols, and whitespace characters.

```r
string <- "Ciao, my name is Sabrina and I am 99 years old!!"
class(string)
```

```
## [1] "character"
```

```r
length(string)
```

```
## [1] 1
```

---

### String and Character Vector

A **character vector** is a collection of strings: each element of the vector is a string!

```r
char_vect <- c("Ciao come stai?", "Ciao", "Hello", "etc.") 
class(char_vect)
```

```
## [1] "character"
```

```r
length(char_vect)
```

```
## [1] 4
```

Thus single string, like in the previous examples, is also a character vector of length one.

We use character vectors when we need to handle multiple pieces of text together (e.g., when we handle multiple strings vs. only one).

---

## Regular Expressions

---

### Regular Expressions: definitions and uses

Regular expressions ("regex" or "regexes") **are strings containing normal characters and/or special meta-characters.** They describe a specific pattern to match in a given text.

More formally, regex:
* is a *language on its own right* 
* used for *pattern matching* 
* adopted by many programming languages, such as R, Python, and others!

Our goal today is to decipher and write patterns like this:

```r
email_pattern <- "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}"
```

---

### Regular Expressions: definitions and uses

Regular expressions are powerful! They allow us to find patterns **in any task that deals with text,** such as NLP (Natural Language Processing) or data-cleaning tasks that involve text.

For example, you can use regular expressions to:

* Extract specific elements from texts (e.g. dates, find words that include a given set of letters, find all past tenses in a text,)
* Perform textual substitutions (e.g. find and replace HTML tags after you scraped a page)
* Verify text format (e.g. "Is this email/phone-number/address valid?")

Today we learn about the package `stringr` and regular expressions: particularly useful for those of you who plan to work with textual data in this course and beyond.

---

## Regular Expression with `stringr()`

---

### The `stringr()` package in R

When you use regular expressions, most likely you will need to use your them together with one of the functions from the `stringr()` package.

This package includes several functions that let you: detect matches in a string, count the number of matches, extract them, replace them with other values, split a string based on a match, etc.

---

### The `stringr()` package in R

Fundamental `stringr()` functions and their use:

* `str_view()` return the first regex match
* `str_view_all()` return all regex matches (deprecated in the last version of `stringr`)
* `str_detect()`: detect matches in a string
* `str_count()`: count the number of matches
* `str_extract()` and `str_extract_all()`: extract matches
* `str_replace()` and `str_replace_all()`: replace matches
* `str_split()`: split a string based on a match

---

### Resources

* [`stringr()` documentation and cheatsheet](https://stringr.tidyverse.org/)

* [Chapter 15](https://plsc-31101.github.io/course/strings-and-regular-expressions.html#applying-regex) by Rochelle Terman, explains `stringr()` (using the *R for Data Science* textbook)

* Read [Chapter 14 "Strings"](https://r4ds.hadley.nz/strings) in "R for Data Science" 2nd Edition (read it all, but especially 14.4 "Extracting data from strings").

* Read [Chapter 17](https://bookdown.org/rdpeng/rprogdatascience/regular-expressions.html#the-stringr-package) from *R Programming for Data Science*. This book covers the entire range of regular expressions packages and functions: you do not need to understand everything, focus on the big picture. In-class we talk about `stringr()`

* For more in-depth info on regular expressions check [this excellent tutorial](https://github.com/ziishaned/learn-regex/blob/master/README.md)

---

## Examples and Practice

Download today's class material from the website to follow along