Sometimes you dream big, but you just don’t have the data…
I had this plan for text analysis by character from the Parks & Recreation series to celebrate Galentine’s Day but getting the data was a struggle. Subtitles are available for pretty much every episode but those don’t contain data about the character who said the lines. I needed scripts, which are a bit harder to come by. I found 6 scripts of episodes on the web from the first 3 seasons in pdf format that were usable. Some pdf’s were image scans so data was not extractable.
That’s not even covering 5% of the series episodes, but let’s use what we’ve got!

But first, what is Galentine’s day I hear you ask?

On Friday February 13th, the day before Valentine, Leslie Knope from the Pawnee Parks department celebrates Galentine’s day with her… gals.

Extracting the data from pdf


#getting the data from pdf
pdf_script <- pdf_text("data/parks-and-recreation-3x07-harvest-festival-2011.pdf")

#remove the first two pages that are cover sheets
pdf_script <- pdf_script[3:34]

As per usual, getting the data in good shape took quite a bit of effort and regex-playtime, but as this won’t be very interesting, I will quickly glance past it.

I needed to extract pdf’s 6 times, so I did the first one manually and then turned this into a function: The function does a few things in a row:

  • It removes the recurring text on top of the pdfs which contains the episode information
  • It extracts only the indented text which is where the actual lines are. Everything else is descriptions of characters/scenes.
  • It removes some other text like “ACT ONE”, “END OF SHOW”. It also removes anything between brackets which was often to indicate a character’s reaction like “(shouting)”.
  • After moving it into one long vector, I could split the text into lines by character. This was possible because the character’s name was all in uppercase and no other text was.
  • And finally moves everything in a dataframe
clean_pr_script <- function(pdf_script, episodeinfo){

  indented_text <- pdf_script %>%
    #removing the first lines on every page that are just title text
    str_replace("PARKS.+\\\r\\n.+10", "") %>% 
    #extracting on the indented text
    str_extract_all(pattern = "\r\n\\s{5,}.+") %>%

  clean_indented_text <- indented_text %>%
    str_replace("(END OF )?COLD OPEN", "") %>%  
    str_replace("(END OF )?ACT [A-Z]+", "") %>%  
    str_replace("TAG", "") %>%
    str_replace("END OF SHOW", "") %>%  
    str_replace("\\(.+\\)", "") %>%  #removes anything between brackets
    str_replace("\r\n\\s+", "")  #removes the new line and extra spaces 

  #pasting into one long text
  one_long_text <- paste(clean_indented_text, collapse = " ")

  #splitting by speaker
  clean_script <- str_extract_all(one_long_text, "[A-Z]{3,}(\\s*\\W*[A-Z]?\\W*[a-z]+\\W*\\d*)+")  
  clean_script <- unlist(clean_script)

  #making a dataframe
  script_split <- str_split(clean_script, pattern = " ", n=2)
  script_split_t <- purrr::transpose(script_split)

  output <- tibble(episode = episodeinfo,
         speaker = unlist(script_split_t[[1]]),
         text = unlist(script_split_t[[2]]))

  mutate_all(output, str_trim)

Time to get all of the episode info out: Given that the number of pages, and especially the number of intro-pages varied, I did not do this as part of the function but manually upfront.

pdf_3x07 <- pdf_text("data/parks-and-recreation-3x07-harvest-festival-2011.pdf")
pdf_3x07 <- pdf_3x07[3:36]
script_3x07 <- clean_pr_script(pdf_3x07, "3x07 Harvest Festival")

pdf_3x05 <- pdf_text("data/parks-and-recreation-3x05-media-blitz-2011.pdf")
pdf_3x05 <- pdf_3x05[3:36] 
script_3x05 <- clean_pr_script(pdf_3x05, "3x05 Media Blitz")

pdf_2x07 <- pdf_text("data/Parks_and_Recreation_2x07.pdf")
pdf_2x07 <- pdf_2x07[3:41] 
script_2x07 <- clean_pr_script(pdf_2x07, "2x07 Greg Pikitis")
script_2x07 <- script_2x07[2:308,]
script_2x07$text <- str_replace(script_2x07$text, "take one.\\s+", "")

pdf_1x02 <- pdf_text("data/parks-and-recreation-1x02-canvassing-2009.pdf")
pdf_1x02 <- pdf_1x02[2:38] 
script_1x02 <- clean_pr_script(pdf_1x02, "1x02 Canvassing")

pdf_2x02 <- pdf_text("data/parks-and-recreation-2x02-the-stakeout-2009.pdf")
pdf_2x02 <- pdf_2x02[5:45] 
script_2x02 <- clean_pr_script(pdf_2x02, "2x02 The Stakeout")

pdf_2x04 <- pdf_text("data/parks-and-recreation-2x04-practice-date-2009.pdf")
pdf_2x04 <- pdf_2x04[3:36] 
script_2x04 <- clean_pr_script(pdf_2x04, "2x04 The Practice Date")

#binding them all together:
script <- bind_rows(script_1x02, script_2x02, script_2x04, script_2x07,
                    script_3x05, script_3x07)

Doing all that will give a dataframe of 1841 lines split by character:

## # A tibble: 1,841 x 3
##    episode         speaker text                                           
##    <chr>           <chr>   <chr>                                          
##  1 1x02 Canvassing LESLIE  Hello, Donna. I have the conference room booke~
##  2 1x02 Canvassing DONNA   Yup.                                           
##  3 1x02 Canvassing LESLIE  And the large meeting room for tomorrow night? 
##  4 1x02 Canvassing DONNA   Yes.   Stop asking.                            
##  5 1x02 Canvassing LESLIE  Tomorrow night is our very first public forum ~
##  6 1x02 Canvassing TOM     Wet Dream.                                     
##  7 1x02 Canvassing LESLIE  Really? Okay. Like the proud Wamapoke tribes o~
##  8 1x02 Canvassing ANN     Mm. Hearty breakfast.                          
##  9 1x02 Canvassing MARK    It’s the most important meal of the day.       
## 10 1x02 Canvassing ANN     I think lunch is the most important--          
## # ... with 1,831 more rows

I’m on avid reader of Julia Silge’s blog on text analysis and I noticed her using two functions from a personal package by David Robinson that are pretty nifty when you want to make ordered faceted plots with words that appear multiple times.
I did not load the entire package, but just copied those two functions to use in here:


#functions from David Robinson
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
  new_x <- paste(x, within, sep = sep)
  stats::reorder(new_x, by, FUN = fun)

scale_x_reordered <- function(..., sep = "___") {
  reg <- paste0(sep, ".+$")
  ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)

Just a reminder for those who know Parks & Recreation: which episodes are in and what are they about? Here are the top words by episode - in many cases these really are enough to remember what the episode is about.

script %>%
  unnest_tokens(word, text) %>% 
  count(episode, word, sort=TRUE) %>% 
  bind_tf_idf(word, episode, n) %>% 
  group_by(episode) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, tf_idf, episode)) %>% 
  ggplot(aes(word, tf_idf, fill = episode)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ episode, scales = "free_y", ncol = 3) +
  scale_x_reordered() +

Text analysis for the main characters

I wasn’t really interested in episode descriptions, but in the characters - who says what a lot in comparison to others?
The td_idf scores show some nice things: Leslie obviously is the most community orientied with words like: festival, forum, our, community, Pawnee and park, but also arch nemisis Greg Pikitis features.
In the early seasons Ben still gets a lot of explaining to for when he became mayor at 18 and building Ice Town.
But although not entirely readable, Andy’s set of words is a true star for me - what a set of random words: sugar rush, rock, walrus, vegetables?

script %>%
  filter(speaker %in% c("LESLIE", "TOM", "ANN", "RON", "APRIL", "BEN", "ANDY")) %>%
  unnest_tokens(word, text) %>% 
  count(speaker, word, sort=TRUE) %>% 
  bind_tf_idf(word, speaker, n) %>% 
  group_by(speaker) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder_within(word, tf_idf, speaker)) %>% 
  ggplot(aes(word, tf_idf, fill = speaker)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~ speaker, scales = "free", nrow = 2) +
  scale_x_reordered() +

What about sentiments? Who is the most positive person in the bunch?

ratio_posneg <- script %>%
  unnest_tokens(word, text) %>% 
  inner_join(get_sentiments("bing"), by="word") %>% 
  count(speaker, sentiment, sort=TRUE) %>% 
  spread(sentiment, n) %>% 
  mutate(total = positive + negative, ratio = positive/negative) %>% 
  filter(total > 12, speaker != "GREG") %>% 
  mutate(speaker = reorder(speaker, ratio))

ggplot(ratio_posneg, aes(x= ratio, y=speaker)) +
  geom_point(size = 3, col = "cadetblue4") +
  labs(title = "Characters according to their positive/negative words ratio",
       x = "Ratio positive to negative words using Bing lexicon",
       y = "Character")

Houston, we sure have no data interpretation issues here: by a long, long mile, the winner is…

Leslie-Knope-style compliments

There was so much more I wanted to do, but the data was really limited. Strangely enough, none of these episodes even contained a Leslie-Knope-style compliment to Ann!
Here are all Leslie’s lines featuring the word ‘Ann’.

script %>%
  filter(speaker == "LESLIE") %>% 
  filter(str_detect(text, pattern = "Ann")) %>% 

##  [1] "Going door-to-door, talking to Ann’s neighbors, whipping them into an excited, positive frenzy, and getting them to come to the forum. Canvassing!"                               
##  [2] "Exactly. Teams are me and Ann, Tom and Mark.  And we’re gonna win!"                                                                                                               
##  [3] "Ann, wait--"                                                                                                                                                                      
##  [4] "Never wish that. Ann, the whole point of democracy is decisions are made by the people, as a group. This is a gift, to be able to talk to them. A gift from the founding fathers."
##  [5] "Hello there! I’m Leslie Knope, from the Parks Department, and this is my colleague Ann."                                                                                          
##  [6] "Jinx!  No backsies. We all had the same idea.  Mark and Ann -- power team."                                                                                                       
##  [7] "I just have to prove to Ann that it’s about perseverance. All I need is one strong “I’ll be there.” The next person I meet, I’m converting."                                      
##  [8] "Mark, Ann?"                                                                                                                                                                       
##  [9] "Ann, please. It is absolutely fine. The most important thing to me is our friendship. You know my code -- hoes before bros."                                                      
## [10] "You know Ann’s name."                                                                                                                                                             
## [11] "Ann is great. But I bet she misses you."                                                                                                                                          
## [12] "So, do you... spy on Ann from the pit?"                                                                                                                                           
## [13] "I’m not sure that’s healthy.   I mean, you’re not with Ann anymore.   She may even be dating other guys."                                                                         
## [14] "I hope Mark and Ann got home okay. I mean, I hope Ann is home, and okay, and Mark is also okay, in his home--"                                                                    
## [15] "Oh, that’s a way better idea. Awesome. Thanks, Ann.  Oh God -- I’m getting so nervous."                                                                                           
## [16] "Ann is so awesome. I’m lucky to have a friend who would spend her night off pretending to treat me so poorly."

“Ann is great”, “Ann is awesome” - yes, those are compliments, but this is not what Leslie is known for!

So I will leave you with a few quotes I got from this website

“Ann, you beautiful tropical fish.”

“Oh, Ann, you beautiful, rule-breaking moth.”

“Ann, you are such a good friend, you’re a beautiful, talented, brilliant, powerful musk-ox. Thank you, ox.”

“Ann, you cunning, pliable, chestnut-haired sunfish.”

“Oh, Ann. You beautiful, naïve, sophisticated newborn baby.”

“Ann, you poetic, noble land-mermaid.”

Happy Galentine’s Day, gals!

Keep on reading…

More parks and recreation R-stats post: