class: center, middle, inverse, title-slide # Brookings Data Network: String Basics ### Sifan Liu ### 2020-09-30 --- # Hello! ??? My name is Sifan Liu, Senior Research Analyst at Brookings Metro. I am also the current co-chair of Brookings Data Network. Today, we are going to talk about how to work with strings in R. Our first fifteen minutes will focus on the basics - this will be great for someone who is new to R, but probably has some data analysis experience in Stata, Excel, or if you are already a R user, but mostly self-taught, and would like to have a refresher on what you could do with strings in R. -- ## What we will cover: ### What is a string? 1. how to create strings 2. how to recognize strings ??? What we will cover first is the definition. We will learn how to create strings in R, and how to tell if a object is a string -- ### What can we do with strings in R? 1. manage length: `str_trim()`, `str_pad()` 2. convert case: `str_to_lower()`, `str_to_upper()`, `str_to_title()` 3. subset and combine: `str_sub()`, `str_c()` 4. detect and replace: `str_detect()`, `str_replace()` ??? We will then go over a couple most commonly used functions that help you to work with strings. Any questions before we move on? --- # String basics - They are also known as "text", "characters", ("chr" for short), etc. ??? Okay. The first thing you need to know is that strings have many names. They are sometimes called "text" as in "text analysis", or "characters", and we use them interchangeably. The way to identify them in R is quite simple - they are always quoted with either single quotes, or double quotes, doesn't matter. -- - Strings are always quoted with single quotes `''` or double quotes `""`. ```r a <- "I am a string" a ``` ``` ## [1] "I am a string" ``` ```r typeof(a) ``` ``` ## [1] "character" ``` ??? For example, Here I created a string by using double quotes around the text. In R, you can use typeof() function to ask directly, what is the type of the object, and here it tells me, it's "character" --- # String basics ```r b <- c("We are", "also strings") b ``` ``` ## [1] "We are" "also strings" ``` ??? Here's another example - you can also create a vector of characters by using the C function in R, which combines all arguments in the funciton to form a vector. We just showed that when you print a character or a character vector, R always returns the strings with quotes. But that's not the case with dataframes. -- exclude: true #### data frame ``` ## b ## 1 We are ## 2 also strings ``` ??? In a dataframe, it no longer shows you the quotes. -- exclude: true #### tibble ``` ## # A tibble: 2 x 1 ## value ## <chr> ## 1 We are ## 2 also strings ``` ??? That's why I always like to use tibbles - also from the tidyverse collections. Tibbles are dataframes, but they have some nice features, including printing the data type of each column --- exclude: true # Quiz .panelset[ .panel[.panel-name[Quiz] Which of the following object is not a string? 1. TRUE 1. "16" 1. c("a", "30") 1. "\n" ] .panel[.panel-name[Answer] ```r list <- list(TRUE, "16", c("a", 30), "\n") purrr::map(list, typeof) ``` ``` ## [[1]] ## [1] "logical" ## ## [[2]] ## [1] "character" ## ## [[3]] ## [1] "character" ## ## [[4]] ## [1] "character" ``` ] ] ??? Now let's do a quick excercise - please type in the chat, which of the following object is not a string? The first one is not a string, because it does not have quotes. In fact, this is a special data type, called logical value. Can anyone tell me why number 3 is also a character - there are no quotes around number 30. (c function coerce all arguments to the simplest type required to represent all information) --- # Let's clean some strings! ```r library(stringr) # "All functions start with 'str_'" ``` ??? Now, let's move on to see what we can do with strings, and we will be using the stringr library, which is part of the tidyverse. All the functions in the stringr function starts with "str_". -- ```r df ``` ``` ## # A tibble: 6 x 4 ## state_code state_name county_code county ## <int> <chr> <int> <chr> ## 1 1 "Alabama " 1 Autauga County ## 2 1 "AL" 3 Baldwin county ## 3 1 " Alabama" 5 Barbour County ## 4 1 "Alabama " 7 Bibb County ## 5 1 "AL" 9 Blount COUNTY ## 6 1 "Alabama" 11 Bullock County ``` ??? Here's a dataframe containing the name of counties, states, and the respective state code and county code. Please type in the chat, what did you notice from this dataset? - spaces in state name (copy from Excel) - county name are in various case (we can use str_length() to quickly show the number of characters in each string - the uneven lengths suggest that there are indeed hidden spaces) - state code and county code are numeric values - why that might be a problem? (County FIPS codes are five digit, the first two are states, and the next three digits are county. You need all five digits to uniquely identify a county in the US) --- # Managing length ??? The easiest thing we could do is to manage the length of a string, you can trim the extra spaces from a string, or pad a string with spaces or other characters. -- ### str_trim() .pull-left[ ```r df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 "Alabama " ## 2 "AL" ## 3 " Alabama" ## 4 "Alabama " ## 5 "AL" ## 6 "Alabama" ``` ] -- .pull-right[ ```r df$state_name <- str_trim(df$state_name) df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 Alabama ## 2 AL ## 3 Alabama ## 4 Alabama ## 5 AL ## 6 Alabama ``` ] ??? Let's first look at trim. To remove whitespace from a string, we use str_trim(), you can specify if you want to trim the white space from left or right - and the default is both. Now we have all the state names without spaces. --- # Managing length ### str_pad() .pull-left[ ```r df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 Alabama ## 2 AL ## 3 Alabama ## 4 Alabama ## 5 AL ## 6 Alabama ``` ] -- .pull-right[ ```r df$state_name_pad <- str_pad(df$state_name, width = 8, side = "left", pad = " ") df["state_name_pad"] ``` ``` ## # A tibble: 6 x 1 ## state_name_pad ## <chr> ## 1 " Alabama" ## 2 " AL" ## 3 " Alabama" ## 4 " Alabama" ## 5 " AL" ## 6 " Alabama" ``` ] ??? The opposite of str_trim is str_pad, which allows you to pad a string with a given length. You specify in order, the expected length of your string, which side to pad, and which characters to pad. In this case I'm adding "space" to the left to make all the string have a lenghth of eight, in order to create this nice right-aligning format. --- # Convert case ```r str_to_lower("Blount COUNTY") ``` ``` ## [1] "blount county" ``` ```r str_to_upper("Blount COUNTY") ``` ``` ## [1] "BLOUNT COUNTY" ``` ```r str_to_title("Blount COUNTY") ``` ``` ## [1] "Blount County" ``` ??? Another useful function is case converstion. Stringr supports three types of case: all lower case, all upper case, or title case, which capitalize the first character of each word. This can be very helpful when you want to make sure different variations of a word are recognized as the same word. --- # Subset and combine .pull-left[ ### str_sub() ```r df["county"] ``` ``` ## # A tibble: 6 x 1 ## county ## <chr> ## 1 Autauga County ## 2 Baldwin county ## 3 Barbour County ## 4 Bibb County ## 5 Blount COUNTY ## 6 Bullock County ``` ] -- .pull-right[ ```r df$county <- str_sub(df$county, start = 1, end = -8) df["county"] ``` ``` ## # A tibble: 6 x 1 ## county ## <chr> ## 1 Autauga ## 2 Baldwin ## 3 Barbour ## 4 Bibb ## 5 Blount ## 6 Bullock ``` | | | | | | | | | | | | | | | | |:-----|:---|:---|:---|:---|:---|:--|:--|:--|:--|:--|:--|:--|:--|:--| |left |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 |14 | |right |-14 |-13 |-12 |-11 |-10 |-9 |-8 |-7 |-6 |-5 |-4 |-3 |-2 |-1 | |chr |A |u |t |a |u |g |a | |C |o |u |n |t |y | ] ??? You could also subtract characters at specified positions - if I only want to keep the first part of the county name, I use str_sub to keep the string "starting from position 1", which is the first character, to the 8th character counting from the end. Note that you can use the negative sign to specify positions from the end. --- ### str_c() ```r df$full_name <- str_c(df$county, ", ",df$state_name,", US") df["full_name"] ``` ``` ## # A tibble: 6 x 1 ## full_name ## <chr> ## 1 Autauga, Alabama, US ## 2 Baldwin, AL, US ## 3 Barbour, Alabama, US ## 4 Bibb, Alabama, US ## 5 Blount, AL, US ## 6 Bullock, Alabama, US ``` ??? The opposite of this, you can combine strings by using str_c(). Here I am combining the county name, state name, and character "US" to create a full name. --- # Detect and replace ??? The final set of functions is detecting certain characters from the string, and you can replace it with another word. -- ### str_detect() .pull-left[ ```r df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 Alabama ## 2 AL ## 3 Alabama ## 4 Alabama ## 5 AL ## 6 Alabama ``` ] -- .pull-right[ ```r df$state_is_AL <- str_detect(df$state_name, pattern = "AL") df["state_is_AL"] ``` ``` ## # A tibble: 6 x 1 ## state_is_AL ## <lgl> ## 1 FALSE ## 2 TRUE ## 3 FALSE ## 4 FALSE ## 5 TRUE ## 6 FALSE ``` ] ??? Here I'm using str_detect() to find which of the county names contain the letter "a". Notice that it is case sensitive, and it return a vector of T/F --- exclude: true ```r df %>% dplyr::filter(str_detect(state_name, pattern = "AL")) ``` ``` ## # A tibble: 2 x 7 ## state_code state_name county_code county state_name_pad full_name state_is_AL ## <int> <chr> <int> <chr> <chr> <chr> <lgl> ## 1 1 AL 3 Baldwin " AL" Baldwin,~ TRUE ## 2 1 AL 9 Blount " AL" Blount, ~ TRUE ``` ??? This can be combined with filter function to return the data that meets the criteria. --- # Detect and replace ### str_replace() .pull-left[ ```r df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 Alabama ## 2 AL ## 3 Alabama ## 4 Alabama ## 5 AL ## 6 Alabama ``` ] .pull-right[ ```r df$state_name <- str_replace(string = df$state_name, pattern = "AL", replacement = "Alabama") df["state_name"] ``` ``` ## # A tibble: 6 x 1 ## state_name ## <chr> ## 1 Alabama ## 2 Alabama ## 3 Alabama ## 4 Alabama ## 5 Alabama ## 6 Alabama ``` ] ??? Similar to str_detect(), str_replace() allows you to modify the part of the string that has been detected with the replacement value you specified. --- # Let's code! .pull-left[ ```r df[c("state_code", "county_code")] ``` ``` ## # A tibble: 6 x 2 ## state_code county_code ## <int> <int> ## 1 1 1 ## 2 1 3 ## 3 1 5 ## 4 1 7 ## 5 1 9 ## 6 1 11 ``` ] .pull-right[ .center[![fips code explainer](fips.png)] ] --- # What's next? -- ## [stringi](https://stringi.gagolewski.com/) -- ## Regular Expressions Define more complex search patterns, for examples: - extract all numbers in the strings - remove all characters after the last `.` - find the string that ends with "s" -- ## Special forms of strings - Dates and time: [lubridate](https://lubridate.tidyverse.org/) - Human names: [humaniformat](https://cran.r-project.org/web/packages/humaniformat/vignettes/Introduction.html) - Addresses: [tidygeocoder](https://cran.r-project.org/web/packages/tidygeocoder/vignettes/tidygeocoder.html) - Extract data from PDFs: [tabulizer](https://cran.r-project.org/web/packages/tabulizer/vignettes/tabulizer.html), [pdftools](https://github.com/ropensci/pdftools)