Skip to main content

tidy text mining --- sentiment and most frequent words (MFW) analyses of Star Trek DS9 first episode, "The Nagus"

A few months ago, the Revolution Analytics newsletter directed me to the 'tidy data' approach to text mining by Julia Silge and David Robinson.  I began trying out their tidytext() R package on The Federalist papers, attempting to sort of replicate an analysis similar to Mosteller and Wallace (1964), and secondly inaugural addresses by U.S. presidents. The Federalist analysis ended up morphing into an application of Burrow's 'rolling delta' and the use of a different R text analytics package.  More on that in a subsequent post.  

Silge and Robinson's text mining examples include a lexicon-based sentiment analysis of Jane Austen's novels.  On example included the net positive versus negative change in sentiment over the progression of each.  So while mulling over what to do next with tidy text mining, I was re-watching the pilot episode of my favorite Star Trek series, Deep Space Nine.   I wondered to what extent the dialogue spoken by characters in screenplays such as DS9 would contain sentiment-laden terms, and whether the course of sentiment throughout the progression of an episode would follow a pattern, perhaps following the contour of the DS9 crew and residents confronting and resolving conflict.  Of course, the screenplay for the first episode of the series --- "The Nagus", written by Ira Steven Behr and directed by David Livingston --- might be somewhat of an outlier.  The episode mostly features Quark, the Ferengi owner of a bar on DS9, and doesn't feature the types of conflict with alien species the way other episodes did. Still, I decided to process the screenplay for the first episode, which aired on January 7, 1993, converting it to a tidy text format, filtering stop (or function words), creating a word cloud, and visualizing episode sentiment.  Below are some highlights.  

There are various websites with DS9 scripts.  Seeing each screenplay is somewhat interesting in itself; each screenplay begins with writing and production credits, tables of characters, sets, and a "pronunciation guide". The episode occurs over six parts, beginning with a "Teaser" (the opening scenes, prior to the first commercial break), through Acts One to Five. 

Excerpt: Star Trek DS9 Screenplay

                          "The Nagus" 
                    (fka "Friends & Foes") 
                           Story by 
                       David Livingtson 
                          Teleplay by 
                        Ira Steven Behr 
                          Directed by 
                       David Livingston
         [cast, sets, and pronunciation guide excluded.]




The door opens, out steps a Ferengi, KRAX.  Tough, authoritative, arrogant.  He checks the corridor. Satisfied  that no danger lurks, he gestures back into the airlock. Out steps ZEK, an ancient hunched-over Ferengi, features mysteriously obscured by a hooded cloak.  In one hand he  clutches the supporting arm of Maihar'du, a tall bald humanoid  alien.  In his other hand, he carries a staff, the handle of which is a smiling Ferengi head made of gold-press latinum. Together Krax and Maihar'du help maneuver the old fellow  slowly down the corridor.


 JAKE is hurrying to get ready for school.  He's sitting on  his bed, putting on his shoes, when SISKO ENTERS from the  living room grinning with anticipation.

   Hey Jake, I've got a terrific surprise    for you.

    (smiling)   Oh yeah, what is it?

   The two of us are going to Bajor for the start of the Gratitude Festival.

   What's the Gratitude Festival?                 

Pre-processing the screenplay

Prior to preparing a 'tidy' version of the screenplay, tokenized on words, I cleaned it up a bit, removing page headers (such as "DEEP SPACE: "The Nagus" - REV. 1/07/93 - ACT ONE 8." and replacing the first header that identifies a section with a description of just the section, such as "ACT ONE".  Numbers that appear to identify a scene within each Act, such as "11   CONTINUED:", were deleted as well. 

Tidying the screenplay and MFWs

After converting the screenplay to a dataframe tibble and tokenizing on words, I followed the procedures in section 1.3 of Text Mining with R to produce a bar graph of the most frequent words, filtered for n > 30.  One difference from Silge and Robinson's bar plot is the color palette. I used the Wes Anderson color palette package to select 11 colors based on set designs from my personal favorite Anderson film, Rushmore.  

Quark appears almost twice as frequently as other characters.  The screenplay text included both cast identifiers and scene directions, such as Quark speaking to Rom: "QUARK (exploding) You worthless, tiny eared fool! Don't you know the First Rule of Acquisition?" With stop words removed a MFW bar graph appears to be a fairly good proxy for the appearances of characters, although of course the 11th most frequent term is "Ferengi". (In the figure, the words are in lower case due to text processing.) 

The next visualization is a bar graph of sentiment based on Saif Mohammad's NRC lexicon implemented in the Syuzhet package. (The color scheme is admittedly wrong and should at least be re-arranged, but is from Moonrise Kingdom, a close second favorite Anderson film.)  Counting the frequencies of sentiment scored words across the screenplay reveals a tendency toward more positive than negative terms overall.  Given that the first episode sets up plot developments in subsequent episodes, it would seem to make sense that the most common positive sentiment is anticipation.  

Lastly, I have an analysis of sentiment -- net positive and negative --- over each Act of the first episode.  This bar graph is modeled after the Tidy Text Mining comparison of Jane Austen's novels across 80 lines of text. I broke up the comparison by Acts, starting with the preliminary material in the screenplay, to Act Zero (the "Teaser") through Act Five. 

It's somewhat intriguing that the positive sentiment declines through the middle acts of the episode, then in Act Five ends on a positive sentiment higher than the opening Act.  

Later, I'll post the R code, along with an analysis of the entire series. I'm still in the process of automating the R code to do so.  Once I get the visualizations prepared, I'll post the code and analyses of changes over time in sentiment.  


  1. Thanks for sharing such an informative Article. I really Enjoyed. It was great reading this article. Keep posting more articles on
    Big Data Solutions 
    Advanced Data Analytics Services
    Data Modernization Solutions
    AI & ML Service Provider


Post a Comment

Popular posts from this blog

Using the survey package in R to analyze the European Social Survey, part 1

Using the survey package in R to analyze the European Social Survey For future reference, I’d like to have a record of tools for analyzing the European Social Survey, via the “survey” package by Lumley ( ). In this post, I simply setup the survey object and demonstrate the tabulation of responses. The examples below require the survey , dplyr , and forcats packages: library(survey) library(dplyr) library(forcats) Below I load a version of the 8th round of the European Social Survey dataset ( ) load(file=url("")) The dataframe within the workspace is ess8 ; it was imported from a Stata datafile with the foreign package; factor labels were preserved for available columns, with one exception: the sampling weight column was replaced with a

Using the survey package in R to analyze the European Social Survey, part 2

Using the survey package in R to analyze the European Social Survey, part 2 Recoding the party support measure We copy paste the old labels and type the new: ess8_at<-ess8_at  %>%    mutate ( at_party_vote =   fct_recode (prtvtbat,      "Social Democratic Party SP"  =  "SP \xd6 " ,      "People's Party VP"  =  " \xd6 VP" ,      "Freedom Party FP"  =  "FP \xd6 " ,      "Alliance for the Future of Austria BZ" =  "BZ \xd6 " ,      "The Greens Gr"  =  "Gr \xfc ne" ,      "Communist Party of Austria KP"  =  "KP \xd6 " ,      "New Austria and Liberal Forum NEOS"  =  "NEOS" ,      "Pirate Party of Austria PIRAIT"  =  "Piratenpartei  \xd6 sterreich" ,      "Team Stronach for Austria"  =  "Team Frank Stronach" ,      NULL =   "Other" ,      NULL =   "Not applicable" ,  

More contour and density plots [stat_density2d() and hdrcde()] of Michigan lottery sales in Grand Rapids

After the prior post of a density map of lottery sales, I thought perhaps I had incorrectly passed on some arguments within ggplot for the use of stat_density2d().  So I looked back through the documentation for  stat_density2d()at .  The example in the documentation is the Old Faithful geyser data, which I recalled from other contour/density plot analyses in Antony Unwin's Graphical Data Analysis with R .   Unwin's discussion of density plots relies on both ggplot() and the hdrcde() packages.  The two packages use different engines for density estimation/contour lines, so perhaps it could be interesting to compare the two.  Let's start with the contour/density estimation in Unwin's book.  Unwin begins with a scatterplot and contour lines for Old Faithful, which shows three distinct clusters of eruptions:  ggplot(geyser, aes(duration, waiting)) + geom_point() +        geom_density2d() +         ggtitle("Old Faithful geyser eruption d