Voices of the Women’s March
On January 21, 2017, following the inauguration of President Trump, upwards of three million Americans participated in the Women’s March in hopes of sending a “bold message” to the new administration and to the world that “women’s rights are human rights”¹. Sparked by Teresa Shook’s Facebook post calling for a “pro-women march” after the results of the 2016 Election were announced², the March quickly grew into an event more massive than Shook herself could comprehend. As the largest single-day protest in American history to date³, the March drew the attention of news outlets such as The New York Times⁴ and The Guardian⁵. Articles documenting Women’s March events in cities across the country soon followed, complete with pictures and videos of the demonstrations taking place in the streets.
Aside from in-person events, however, the Women’s March also attracted many participants online, who engaged with the movement through their Tweets. By applying text analysis methods to these Tweets, this project aims to enhance the existing narrative surrounding the Women’s March, by answering questions such as:
- Where: In which states and countries did participants in the online conversation surrounding the Women’s March reside?
- Who: Who participated in the Twitter conversation surrounding the Women’s March? Which participants were the most influential?
- What: What issues were raised most frequently in the Twitter conversation surrounding the Women’s March? What categories did these issues belong to?
- How: How did Twitter users feel about the topics they discussed? Which sentiments were expressed most frequently?
- When: How did the conversation surrounding the Women’s March evolve over time?
In doing so, this project also hopes to:
- Offer readers who are curious about the Women’s March a more in-depth understanding of the online discourse surrounding the movement. In particular, I hope that readers interested in expanding their understanding of the March, beyond the details reported by news outlets about the in-person demonstrations, will find this project informative.
- Suggest tools and strategies that can be employed to analyze social media data. More specifically, I hope that this project serves as a useful example for readers who are curious about techniques for extracting preliminary insights from large TweetSets.
About the Data
To answer the questions posed above, I made use of Justin Littman and Soomin Park’s “Women’s March” TweetSet, which I accessed through the George Washington University Library Dataverse. However, as the original TweetSet is large – it contains 7,275,228 Tweets spanning from December 19, 2016 to January 23, 2017 – running analyses on it can be computationally expensive. As such, this project focuses only on Tweets spanning from January 12 to January 22. This amounts to 3,293,053 observations of 30 variables; a dataset that, while still fairly large, is far more manageable. Among these remaining observations, 2,648,290 are of Retweets, while 644,763 are of unique Tweets.
About the Tools
Most of this project was executed in R, so as to leverage the many existing packages which make importing, cleaning, analyzing and visualizing data convenient. Some key packages used in this project include:
- data.table for importing the large original TweetSet efficiently.
- dplyr and tidyr for data manipulation.
- stringr and textclean for string manipulation.
- igraph for social network analysis.
- tidytext and vader for sentiment analysis.
- kableExtra, ggplot2, radarchart and gganimate for creating visualizations.
Additionally, Tableau was used to create maps and the wordcloud package in Python was used to create word clouds. For reference, I have made my R and Python code available here.
About the Process
Since the reduced dataset for this project still contained over three million observations, importing and preparing the data for analysis was a hassle. To tackle this challenge, I employed a two-pronged approach of hashtag-based analysis and data aggregation.
To reduce the time taken to import and conduct analysis on the data, I chose to group and segment the Tweets according to the types of hashtags that were used in them. To achieve this, I first extracted a list of hashtags that were used in at least 30 unique English Tweets. Next, I manually assigned labels to each hashtag in the list. Each hashtag was assigned at least one of five primary labels, namely “Location”, “Women’s March”, “Sociopolitical Issues”, “American Politics” and “Themes and Slogans”. Some hashtags were also assigned secondary and/or tertiary categories within their primary labels. Finally, I applied my manually-assigned labels to the hashtags found in each Tweet in the dataset. By segmenting my dataset according to these labels and centering each part of my analysis around a single primary label, I reduced the size of the data required in each part of my analysis and thus, the associated computational time.
Another strategy I used to reduce the time taken to import and conduct analysis on the data was data aggregation. More specifically, I organized and condensed the dataset of over three million observations into three separate sub-datasets, each intended for use in select parts of my analysis. Here is a brief overview of the contents of each sub-dataset:
- df1 contains information on Tweets in English only. Each observation corresponds to a unique Tweet-hashtag combination. Variables include the total Tweet count, the mean Retweet count, the cleaned text of the Tweet, a hashtag used in the Tweet and the labels that the hashtag has been assigned.
- df2 contains information about interactions, as measured by Retweets, between Twitter users in the dataset. Each observation contains a unique pair of “from” and “to” Twitter handles, and a “weight” variable which records the number of times the “to” handle Retweeted content from the “from” handle’s account.
- df3 contains information on hashtags marked with the “Themes and Slogans” label only. Each observation corresponds to a unique hashtag-date combination, and includes a “retweet” variable which reports the mean Retweet count for that hashtag, on that date.
Where: “Location” Hashtags
To answer the question regarding the location of the participants in the Women’s March Twitter conversation, I created two maps in Tableau. In each map, the color of each location indicates the percentile of the total Tweet count for that location, relative to the location with the highest total Tweet count. As expected, the location with the highest total Tweet count in the map of American states is Washington D.C. – the location of the flagship march, while the location with the highest total Tweet count in the map of countries is the United States.
Taking a closer look at the map of American states, I found that the states with the five highest total Tweet counts were California (97.6% relative to Washington D.C.), New York (95.2%), Texas (92.9%), Massachusetts (90.5%) and Illinois (88.1%). I was unsurprised to find California and New York in this top-five list, since both are Blue states with large populations. I was, however, surprised to see Texas, a Red state, on this list.
Turning to the map of countries, I found that the countries with the five highest total Tweet counts, excluding the U.S., were the United Kingdom (96.6% relative to the U.S.), Canada (93.1%), Australia (89.7%), France (86.2%) and New Zealand (82.8%). That four out of five of these countries are English-speaking could be due, in part, to the fact that this map was created using only English Tweets. That said, countries that are not predominantly English-speaking are still rather well-represented, which suggests that the Women’s March truly resonated on a global level despite cultural and linguistic differences.
Who: “Women’s March” Hashtags
To gain insight into the most influential voices in the Twitter conversation surrounding the Women’s March, I obtained estimates for the betweenness and closeness centralities of each Twitter user in my dataset. The two tables above show the users with the 10 highest betweenness and closeness centrality estimates respectively. Each row is color-coded to indicate the type of user, with red, green, blue and purple representing “Female”, “Organization”, “Male” and “Other” respectively. Verified users are also indicated by the use of bold text.
Betweenness centrality is an indication of a user’s ability to control the flow of information, which results from the occupation of a central position in the network⁶. From the table on the left, the two users with the highest betweenness centrality estimates are Women’s March and Moms Demand Action. Both of these are verified accounts, run by organizations that seek to drive social change by inspiring members of the public to speak up and take action. It is thus unsurprising that these two users occupy central positions in a network of Twitter users who are engaged in discussions about the Women’s March, a large-scale social movement. Interestingly, three unverified, personal accounts belonging to women who post content related to sociopolitical issues (handles: MeghanAnomaly, bulldoghill and MmeEmmeline) were included in this top-10 list. This could suggest that the Twitter conversation surrounding the Women’s March was driven not only by organizations and public figures, but by ordinary women who are passionate about sociopolitical issues as well.
Closeness centrality is an indication of how efficiently a user spreads information. From the table on the right, eight of the accounts included in the top-10 list of closeness centrality estimates are verified, most of which belong to individuals with previous claims to fame. This makes sense: as public figures, these Twitter users have likely amassed large followings, which in turn enable them to disseminate information efficiently.
What: “Sociopolitical Issues” Hashtags
One question that I was particularly interested to explore was: what sociopolitical issues did Twitter conversations surrounding the March raise? This question was motivated in part by the “unapologetically progressive” policy platform put forth by the organizers of the flagship march in D.C., which expressed support not only for issues that can be labeled as the “usual feminist suspects”, but also for issues ranging from racial profiling to healthcare⁷. To determine whether the Twitter conversations surrounding the March raised issues as diverse as those discussed in the official policy platform, I created two word clouds of hashtags related to various sociopolitical issues. Hashtag sizes correspond to total Tweet counts in the word cloud on the left, and to mean Retweet counts in the word cloud on the right. Hashtags are also color-coded by issue category in both word clouds.
As one might expect, some of the most frequently used hashtags in both word clouds, for instance “women” and “womensrights”, were those under the Gender and Sex category (colored light red). Unsurprisingly, the Abortion category (light green) was also relatively well-represented, with some hashtags making reference to the landmark Roe v. Wade decision⁸.
However, as is evident from the word cloud on the right, a number of hashtags from other categories also enjoyed high levels of popularity. In particular, hashtags from the Race and Religion category (light blue), for example “indigenouswomenrise” and “hijab”, recorded high mean Retweet counts. Incidentally, both of these hashtags highlight the concept of intersectional feminism by referencing racial and religious minorities, as well as women.
Furthermore, the word cloud on the left shows that the hashtags used in the Twitter conversation surrounding the March cover a wide variety of issues, including Disability Rights (brown – the default color for issues with lower counts), Immigration (dark blue), Gun Reform (brown) and Queer Rights (dark green). This is consistent with the organizers’ policy platform.
How: “American Politics” Hashtags
Given that the Women’s March was held during an emotionally-charged period – a few weeks after the conclusion of the divisive and “far more negative” 2016 Presidential Election⁹, and the day after Trump’s inauguration – I was curious about the sentiments that Twitter users expressed towards various topics in American politics. As such, I applied three sentiment analysis lexicons, namely AFINN, VADER and NRC, to explore Tweets in my dataset which contained hashtags that I had categorized under “American Politics”.
To begin, I used the AFINN lexicon to analyze Tweets containing hashtags related to seven topics in American politics, which I titled “Barack Obama”, “Bernie Sanders”, “Democrat and Liberal”, “Donald Trump”, “Elections”, “Hillary Clinton” and “Republican”. To do so, I:
- Used the AFINN lexicon to assign each word in the relevant Tweets a value between -5 and 5, with negative values indicating negative sentiments and positive values indicating positive sentiments. Words not found in the AFINN lexicon were removed.
- Constructed populations of words for each of the seven topics. For each topic, I first created a list of the remaining words, then either kept only unique words from this list, or created copies of each word based on Tweet and Retweet counts.
- Drew 15,000 samples of 35, 75 and 93 words from the “Unique Words”, “Weighted by Tweets” and “Weighted by Retweets” populations respectively. I then calculated the mean of the AFINN values of the words in each sample.
- Created boxplots to visualize the distributions of the sample means. Within each row of boxplots, topics are sorted in ascending order, with the most negative topics on the left and the most positive topics on the right.
In the boxplots above, my attention was particularly drawn to the topics “Barack Obama” and “Hillary Clinton”. For both of these topics, a large difference was observed when sampling from the “Weighted by Tweets” and “Weighted by Retweets” populations, as opposed to sampling from the “Unique Words” population. For the “Barack Obama” topic, the boxplot generated by sampling from the “Unique Words” population was the second-most positive, but the boxplots generated by sampling from the other two populations were among the most negative. This could suggest that the most popular Tweets with hashtags related to Obama tended to contain negative words. In contrast, for the “Hillary Clinton” topic, the boxplot generated by sampling from the “Unique Words” population was the most negative, while the boxplots generated by sampling from the other two populations were relatively neutral compared to the other topics. This could suggest that, although many negative words were used in Tweets related to Clinton, the most popular Tweets about her contained a more balanced mix of positive and negative words. To gain more insight into the differences between the “Unique Words” and weighted populations for these two topics, I looked at the top negative words in Tweets related to Obama and the top positive words in Tweets related to Clinton.
As can be seen from the table on the left, the negative words that were used most frequently in Tweets related to Obama include “miss”, “cry” and “mourn”. Given that Trump’s inauguration also marked the end of Obama’s time in office, many Americans took to Twitter to express their gratitude towards Obama, often stating that he would be missed. Some even suggested that the handover from Obama to Trump had prompted them to “cry” and “mourn”. Taking context into account, it appears that in this instance, the popularity of negative Tweets related to Obama did not indicate that Twitter users disapproved of him, but rather that Twitter users were fond of him and were reluctant to see his presidency come to an end.
Turning to the table on the right, the positive word most frequently used in Tweets related to Clinton was “honor”. Use of the word in relation to Clinton and the Women’s March skyrocketed when the organizers of the D.C. event released a list of 27 women who “paved the way” for the March, but left Clinton off this list¹⁰. Other positive words used in popular Tweets related to Clinton include “win”, “popular” and “won”, often in reference to Clinton losing the 2016 Election despite winning the popular vote.
As detailed above, the AFINN-based exploration produced some interesting results. However, given that each word was scored individually without taking context into account, I wondered if the results obtained accurately reflected the sentiments expressed by each full-length Tweet. Thus, I ran a similar analysis using VADER – a lexicon which is “specifically attuned to sentiments expressed in social media”¹¹, and which can produce a compound score to reflect the overall valence of a sentence. To do so, I:
- Used the VADER lexicon to assign a compound score between -1 and 1 to each Tweet.
- Constructed populations of Tweets for each topic, with one population being “Unweighted” and the other two being weighted by Tweet and Retweet counts.
- Drew 15,000 samples of 100, 250 and 430 Tweets from the “Unweighted”, “Weighted by Tweets” and “Weighted by Retweets” populations respectively. I then calculated the mean compound VADER scores of the Tweets in each sample.
- Created boxplots to visualize the distributions of the sample means.
When looking at this set of boxplots, I was curious about how different the positions of topics across rows were, relative to the positions across rows in the set of boxplots created using AFINN. I found that some topics were sorted into similar positions based on their mean compound VADER scores. For instance, the “Democrat and Liberal” topic was among the most negative, and the “Bernie Sanders” topic was among the most positive, in all three rows of both the AFINN and VADER boxplots.
On the other hand, topics such as “Republican” and “Elections” were sorted into noticeably different positions across the two sets of boxplots. In particular, the “Elections” topic produced relatively more negative compound VADER scores than AFINN values in all three rows. This could suggest that a significant number of the positive words in Tweets related to the 2016 Elections were preceded by words such as “not” that negated their meaning.
After exploring the sentiments expressed by Twitter users towards the seven broad topics in American politics above, I chose to take a closer look at Tweets containing hashtags related to Trump. One key factor that motivated this decision was the size of this topic – more than 38,000 Tweets contained hashtags related to Trump, whereas the next largest topic, “Hillary Clinton”, includes only approximately 4,500 Tweets.
To further explore how Twitter users felt about Trump, I first classified the hashtags from the “Donald Trump” topic into seven sub-topics, namely “Criticism”, “Echoing Rhetoric”, “Impeachment”, “Inauguration”, “John Lewis”, “Reclaiming Rhetoric” and “Related Figures”. Next, I used the NRC lexicon to assign one of eight emotions – anger, anticipation, disgust, fear, joy, sadness, surprise and trust – to each word in the relevant Tweets. Words not found in the NRC lexicon were removed. Within each sub-topic, I then calculated the proportion of words expressing each emotion, relative to the total number of remaining words. Lastly, I created a radar chart to visualize these proportions.
Comparing the polygons for each sub-topic in the radar chart, the sentiments expressed in Tweets discussing the subject of “Impeachment” were the most distinct from other Trump-related Tweets. More specifically, the proportions of words that expressed anger, disgust and fear were the highest in Tweets calling for Trump’s impeachment, as compared to Tweets in all other sub-topics. This makes sense: given that impeachment is the “ultimate check” on those occupying positions of power¹², those calling for the initiation of such a severe process against the then newly-elected President Trump must have been strongly opposed to him.
Another sub-topic that piqued my interest was “John Lewis”. Based on the radar chart, Tweets related to the Georgia representative were mostly positive, with the proportions of words expressing joy and trust being the highest of all the sub-topics. To put this outpouring of support for the congressman and “civil rights icon” into context, Trump had published Tweets on January 14 claiming that Lewis was “all talk” and “no action”, and that his district was in “horrible shambles”¹³. In response to Trump’s criticism, Twitter users quickly came to Lewis’ defense, publishing Tweets with hashtags such as “istandwithjohnlewis” and “defendthe5th”.
When: “Themes and Slogans” Hashtags
To explore how the Twitter conversation surrounding the Women’s March evolved over time, I created an animated bar chart showing how the cumulative Retweet count for each of the 10 most Retweeted hashtags in the “Themes and Slogans” category varied by day. A logarithmic scale was used on the x-axis for clarity.
Some of the hashtags in the chart saw a rather steady increase in cumulative Retweet counts across the 10 days. Examples include three hashtags which capture the idea of resistance, namely “theresistance”, “resist” and “resistance”. That three hashtags expressing the same idea appeared in this chart of the 10 most Retweeted hashtags, and each saw steady increases in cumulative Retweet counts, suggests that the theme of resistance was key to the Twitter conversation surrounding the March not only in aggregate, but also as a persistent motif.
For other hashtags, the rate of increase in cumulative Retweet counts varied with time. For instance, the hashtag “callingallwomen” started out with approximately 100 Retweets on January 12, but saw a relatively slow growth in popularity thereafter. On the other hand, the hashtag “icantkeepquiet” did not receive any Retweets until January 17, but saw an exponential growth in popularity following the day of the March. By looking into the origin of this phrase, I discovered a potential reason for this trend. As the refrain of the chorus in MILCK’s “Quiet”, the phrase likely rose to prominence after the song was performed by MILCK and two female a cappella groups at the March in D.C.¹⁴.
Although most reports on the Women’s March focused on live, in-person demonstrations, this project suggests that analyzing the online conversation surrounding the movement is worthwhile. For example, this project offered new insight to the Women’s March by identifying key voices in the conversation, highlighting related events such as the conflict between John Lewis and Donald Trump, as well as drawing attention to key themes or phrases, including “I Can’t Keep Quiet” and “resist”, that were used to rally supporters, among other findings. Looking beyond this project, given that social media increasingly occupies a key role in large-scale demonstrations – the Women’s March, for instance, was sparked by a Facebook post – analyses using methods similar to those described above are likely to be increasingly significant in crafting a well-rounded picture of modern social movements.
- Salazar, Alejandra Maria. “Organizers Hope Women’s March On Washington Inspires, Evolves.” NPR, NPR, 21 Dec. 2016, www.npr.org/2016/12/21/506299560/womens-march-on-washington-aims-to-be-more-than-protest-but-will-it.
- Stein, Perry. “The Woman Who Started the Women’s March with a Facebook Post Reflects: ‘It Was Mind-Boggling’.” The Washington Post, WP Company, 28 Mar. 2019, www.washingtonpost.com/news/local/wp/2017/01/31/the-woman-who-started-the-womens-march-with-a-facebook-post-reflects-it-was-mind-boggling/.
- Broomfield, Matt. “2 Charts Which Show Just How Huge the Women’s Marches against Trump Were.” The Independent, Independent Digital News and Media, 23 Jan. 2017, www.independent.co.uk/news/world/americas/womens-march-anti-donald-trump-womens-rights-largest-protest-demonstration-us-history-political-scientists-a7541081.html.
- Schmidt, Kiersten, and Sarah Almukhtar. “Where Women’s Marches Are Happening Around the World.” The New York Times, The New York Times, 17 Jan. 2017, www.nytimes.com/interactive/2017/01/17/us/womens-march.html.
- Smith, David. “Women’s March on Washington Overshadows Trump’s First Full Day in Office.” The Guardian, Guardian News and Media, 22 Jan. 2017, www.theguardian.com/us-news/2017/jan/21/donald-trump-first-24-hours-global-protests-dark-speech-healthcare.
- “Capturing Value with Social Media Network Analytics.” Creating Value with Social Media Analytics: Managing, Aligning, and Mining Social Media Text, Networks, Actions, Location, Apps, Hyperlinks, Multimedia, & Search Engines Data, by Gohar F. Khan, CreateSpace, 2018, pp. 175-216.
- Cauterucci, Christina. “The Women’s March on Washington Has Released an Unapologetically Progressive Platform.” Slate Magazine, Slate, 12 Jan. 2017, slate.com/human-interest/2017/01/the-womens-march-on-washington-has-released-its-platform-and-it-is-unapologetically-progressive.html.
- “Roe v. Wade, 410 U.S. 113 (1973).” Justia Law, supreme.justia.com/cases/federal/us/410/113/.
- “Voters’ Evaluations of the 2016 Campaign.” Pew Research Center – U.S. Politics & Policy, Pew Research Center, 30 May 2020, www.pewresearch.org/politics/2016/11/21/voters-evaluations-of-the-campaign/#campaign-viewed-as-heavy-on-negative-campaigning-light-on-issues.
- Bellstrom, Kristen. “This Is Why Hillary Clinton Supporters Are Upset About the Women’s March.” Fortune, Fortune, 20 Jan. 2017, fortune.com/2017/01/20/hillary-clinton-womens-march/.
- Roehrick, Katherine. “Vader v0.2.1.” Vader Package | R Documentation, www.rdocumentation.org/packages/vader/versions/0.2.1.
- Williams, Pete, et al. “What Is Impeachment and How Does It Work? 10 Facts to Know.” NBCNews.com, NBCUniversal News Group, 19 Dec. 2019, www.nbcnews.com/politics/congress/what-impeachment-how-does-it-work-10-facts-know-n1072451.
- Scott, Eugene. “Trump Rips ‘All Talk,’ ‘No Action’ Civil Rights Icon Lewis.” CNN, Cable News Network, 15 Jan. 2017, edition.cnn.com/2017/01/14/politics/john-lewis-donald-trump/index.html.
- Blair, Elizabeth. “A Song Called ‘Quiet’ Struck A Chord With Women. Two Years Later, It’s Still Ringing.” NPR, NPR, 14 Jan. 2019, www.npr.org/2019/01/14/683694934/milck-quiet-womens-march-american-anthem.