cft

Strong Women Through the Lens of The New York Times: Research on gender equality and representation in historic print

This is a research on gender representation in The New York Times by means of NLP tools.


user

Sasha Prokhorova

3 years ago | 14 min read

Photo by Giacomo Ferroni on Unsplash

The goal of this project is to investigate women’s representation in The New York Times throughout the past 70 years by means of sentiment analysis, frequent term visualization and topic modeling.

For this investigation I scraped The New York Times data through the Archive API of The New York Times Developer Portal. First, you have to obtain the API key here. It’s free! The NYT just likes the concept of the regulated flood gate. Since this type of the API is good for the bulk data collection, it doesn’t allow for effective prior filtering. Please follow the instructions in the Jupyter notebooks posted on Github if you wish to re-create the experiment. If you prefer a video version of this post, you can access it here.

Analysis pipeline. Image by author. Icons by Freepik.
Analysis pipeline. Image by author. Icons by Freepik.

All the instructions, code notebooks and results can also be accessed through my project repository on GitHub for smoother replication.

Data Collection via Archive API and Topic Modeling with SpaCy and Gensim

Before I proceed any further with my analysis, I decided to run topic modeling on the bulk of the articles from The New York Times between January 2019 and present day, September 2020, to analyze the headlines, keywords and the lead paragraphs. My goal was to distinguish the most prevalent issues and enduring topics in order to make sure that my research goes along the lines of the NYT mission statement, and I’m not misrepresenting their journalism style.

The data collection blueprint for this part of the analysis was inspired by a very informative tutorial by Brienna Herold.

Let’s import the necessary tools and libraries:

import os
import pandas as pd
import requests
import json
import time
import dateutil
import datetime
from dateutil.relativedelta import relativedelta
import glob

Determine the timeframe of the analysis:

end = datetime.date.today()
start = datetime.date(2019, 1, 1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Breaking the data into the monthly groups:

months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

The following set of helper functions (see the tutorial) extracts the NYT data through the API and saves it into the specific csv files:

def send_request(date):
'''Sends a request to the NYT Archive API for given date.'''
base_url = 'https://api.nytimes.com/svc/archive/v1/'
url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + 'F9FPP1mJjiX8pAEFAxBYBg08vZECa39n'
try:
response = requests.get(url, verify=False).json()
except Exception:
return None
time.sleep(6)
return response
def is_valid(article, date):
'''An article is only worth checking if it is in range, and has a headline.'''
is_in_range = date > start and date < end
has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()
return is_in_range and has_headline
def parse_response(response):
'''Parses and returns response as pandas data frame.'''
data = {'headline': [],
'date': [],
'doc_type': [],
'material_type': [],
'section': [],
'keywords': [],
'lead_paragraph': []}

articles = response['response']['docs']
for article in articles: # For each article, make sure it falls within our date range
date = dateutil.parser.parse(article['pub_date']).date()
if is_valid(article, date):
data['date'].append(date)
data['headline'].append(article['headline']['main'])
if 'section' in article:
data['section'].append(article['section_name'])
else:
data['section'].append(None)
data['doc_type'].append(article['document_type'])
if 'type_of_material' in article:
data['material_type'].append(article['type_of_material'])
else:
data['material_type'].append(None)
keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
data['keywords'].append(keywords)
if 'lead_paragraph' in article:
data['lead_paragraph'].append(article['lead_paragraph'])
else:
data['lead_paragraph'].append(None)
return pd.DataFrame(data)
def get_data(dates):
'''Sends and parses request/response to/from NYT Archive API for given dates.'''
total = 0
print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))
if not os.path.exists('headlines'):
os.mkdir('headlines')
for date in dates:
print('Working on ' + str(date) + '...')
csv_path = 'headlines/' + date[0] + '-' + date[1] + '.csv'
if not os.path.exists(csv_path): # If we don't already have this month
response = send_request(date)
if response is not None:
df = parse_response(response)
total += len(df)
df.to_csv(csv_path, index=False)
print('Saving ' + csv_path + '...')
print('Number of articles collected: ' + str(total))

Let’s take a closer look at the helper functions:

  • send_request(date) sends a request into the archive for a given date, converts into the json format, returns response.
  • is_valid(article, date) checks whether an article is within the requested timeframe, confirms the presence of the headline and returns is_in_range and has_headline verdict.
  • parse_response(response) transforms the response into a DataFrame. data is a dictionary that contains columns of our DataFrame, which are empty at first, but will get appended to by this function. The function returns the final DataFrame.
  • get_data(dates), where dates correspond to the range specified by the user, utilizes send_request() and parse_response() functions. Saves headlines and other info to .csv files, one file per month per year within the range.

Once we get our monthly csv files for each year within the range, we can concatenate them for further use. glob library is an excellent tool for that. make sure your path to the headlines folder matches the path in your code. I used a relative path as opposed to the absolute path for mine.

# get data file names
path = "headlines/"
filenames = glob.glob("*.csv")
dfs = []
print(filenames)
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

big_frame is a DataFrame that consists all our files from the headlines folder concatenated into one frame. This is the expected output:

135,954 articles and their data were pulled.
135,954 articles and their data were pulled.

Now, we are ready for topic modeling. The purpose of the analysis below is to run topic modeling on headlines, keywords and lead paragraphs of The New York Times articles for the past year and a half. I want to make sure that headlines are consistent with the introductory paragraphs and keywords.

Importing tools and libraries:

from collections import defaultdict
import re, string
from gensim import corpora # this is the topic modeling library
from gensim.models import LdaModel

Let’s take a closer look:

  • defaultdict is useful for counting the unique words and their appearances.
  • re and string are useful when we’re looking for a match in the text, either a full or a fuzzy one. Regular expressions are going to appear often if you’re interested in text analysis; here’s a handy tool to practice those.
  • gensim is a library we are going to use for topic modeling. It is user-friendly once you get the necessary dependencies sorted out.

Since we are looking at three different columns of the DataFrame, three different instances of the corpus will be instantiated: a corpus that holds headlines, a corpus for the keywords and a corpus for the lead paragraphs. This is meant to be a sanity check to make sure headlines and keywords and lead paragraphs are consistent with the article’s content.

big_frame_corpus_headline = big_frame['headline']
big_frame_corpus_keywords = big_frame['keywords']
big_frame_corpus_lead = big_frame['lead_paragraph']

In order to for the text data to be usable, it needs to be pre-processed. In general, it will look like that: lowercasing and punctuation removal, stemming, lemmatization and tokenization, then stop-word removal and vectorization. The first four operations are shown as a cluster, because the order of those operations often depends on data, and in certain cases it might make sense to switch the order of operations.

Text pre-processing steps. Image by author. Icons by Freepik
Text pre-processing steps. Image by author. Icons by Freepik

Let’s talk about pre-processing.

from nltk.corpus import stopwords
headlines = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_headline]
keywords = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_keywords]lead = [re.sub(r'[^\w\s]','',str(item)) for item in big_frame_corpus_lead]
stopwords = set(stopwords.words('english'))
# please note: you can append to this list of pre-defined stopwords if needed

More pre-processing:

headline_texts = [[word for word in document.lower().split() if word not in stopwords] for document in headlines]
keywords_texts = [[word for word in document.lower().split() if word not in stopwords] for document in keywords]
lead_texts = [[word for word in document.lower().split() if word not in stopwords] for document in lead]

Removing less frequent words:

frequency = defaultdict(int)
for headline_text in headline_texts:
for token in headline_text:
frequency[token] += 1
for keywords_text in keywords_texts:
for token in keywords_text:
frequency[token] += 1
for lead_text in lead_texts:
for token in lead_text:
frequency[token] += 1

headline_texts = [[token for token in headline_text if frequency[token] > 1] for headline_text in headline_texts]
keywords_texts = [[token for token in keywords_text if frequency[token] > 1] for keywords_text in keywords_texts]
lead_texts = [[token for token in lead_text if frequency[token] > 1] for lead_text in lead_texts]
dictionary_headline = corpora.Dictionary(headline_texts)
dictionary_keywords = corpora.Dictionary(keywords_texts)
dictionary_lead = corpora.Dictionary(lead_texts)
headline_corpus = [dictionary.doc2bow(headline_text) for headline_text in headline_texts]
keywords_corpus = [dictionary.doc2bow(keywords_text) for keywords_text in keywords_texts]
lead_corpus = [dictionary.doc2bow(lead_text) for lead_text in lead_texts]

Let’s decide on the optimal number of topics for our case:

NUM_TOPICS = 5
ldamodel_headlines = LdaModel(headline_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_keywords = LdaModel(keywords_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)
ldamodel_lead = LdaModel(lead_corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=12)

Here’s the result:

topics_headlines = ldamodel_headlines.show_topics()
for topic_headlines in topics_headlines:
print(topic_headlines)topics_keywords = ldamodel_keywords.show_topics()
for topic_keywords in topics_keywords:
print(topic_keywords)topics_lead = ldamodel_lead.show_topics()
for topic_lead in topics_lead:
print(topic_lead)

Let’s organize those into dataframes:

word_dict_headlines = {}
for i in range(NUM_TOPICS):
words_headlines = ldamodel_headlines.show_topic(i, topn = 20)
word_dict_headlines['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_headlines]
pd.DataFrame(word_dict_headlines)
for i in range(NUM_TOPICS):
words_keywords = ldamodel_keywords.show_topic(i, topn = 20)
word_dict_keywords['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_keywords]
pd.DataFrame(word_dict_keywords)
for i in range(NUM_TOPICS):
words_lead = ldamodel_lead.show_topic(i, topn = 20)
word_dict_lead ['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words_lead]
pd.DataFrame(word_dict_lead)

Remember: even though the algorithm can sort the words into the corresponding topics, it’s still up to a human to interpret and label them.

Topic modeling results. Image by author. Icons by Freepik.
Topic modeling results. Image by author. Icons by Freepik.

A variety of topics showed up. All of them are very serious and important in our society. In this particular research, we are going to investigate gender representation.

1950 — Present: Data Collection and Keyword Analysis.

We will use the previously mentioned helper functions in order to get the data from January 1st, 1950 to present day, which is September 2020. I recommend using smaller increments of time, e.g. a decade in order to prevent the API timing-out.

The data will be collected into the headlines.csv and then concatenated into one dataframe using the methods illustrated above. Once you get the dataframe that you worked so hard to get, I suggest pickling it for further use:

import pickle
with open('frame_all.pickle', 'wb') as to_write:
pickle.dump(frame, to_write)

Here’s how you extract the pickled files:

with open('frame_all.pickle', 'rb') as read_file:
df = pickle.load(read_file)
Total articles found vs. relevant articles for the timeframe of 70 years. Image by author. Template by Slidesgo.
Total articles found vs. relevant articles for the timeframe of 70 years. Image by author. Template by Slidesgo.

Let’s convert the date column into the datetime format so that the articles can be sorted chronologically. We will also be removing nulls and duplicates.

df['date'] = pd.to_datetime(df['date'])
df = df[df['headline'].notna()].drop_duplicates().sort_values(by='date')
df.dropna(axis=0, subset=['keywords'], inplace = True)

Examining the relevant keywords:

import ast
df.keywords = df.keywords.astype(str).str.lower().transform(ast.literal_eval)
keyword_counts = pd.Series(x for l in df['keywords'] for x in l).value_counts(ascending=False)
len(keyword_counts)

58,298 unique keywords.

I used my personal judgement to determine which keywords are relevant to the topic of strong women and their representation: politics, social activism, entrepreneurship, science, technology, military achievement, athletic breakthroughs and female leadership. This analysis is not in any way meant to exclude any groups or individuals from the notion of strong women. I am open to additions and suggestions, so please don’t hesitate to reach out if you think there’s something that can be done to make this project more comprehensive. A quick reminder if you find the code in the cells challenging to copy due to formatting issues, please refer to the code and instructions in my project repository.

project_keywords1 = [x for x in keyword_counts.keys() if 'women in politics' in x
or 'businesswoman' in x
or 'female executive' in x
or 'female leader' in x
or 'female leadership' in x
or 'successful woman' in x
or 'female entrepreneur' in x
or 'woman entrepreneur' in x
or 'women in tech' in x
or 'female technology' in x
or 'female startup' in x
or 'female founder' in x ]

Above is a sample query for relevant keywords. A more detailed explanation on relevant keyword search and article headline extraction can be found in this notebook.

Now, let’s examine the headlines that have to do with women in politics.

First, we normalize them by lowercasing:

df['headline'] = df['headline'].astype(str).str.lower()

Examine the headlines that contain words like woman, politics and power:

wip_headlines = df[df['headline'].str.contains(('women' or 'woman' or 'female')) & df['headline'].str.contains(('politics' or 'power' or 'election'))]

‘wip’ stands for ‘women in politics’.

Our search returned only 185 headlines. Let’s look at the keywords to supplement that.

df['keywords'].dropna()
df['keywords_joined'] = df.keywords.apply(', '.join)
df['keywords_joined'] = df['keywords_joined'].astype(str)
import re
wip_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*politics)',regex=True)]
Women in politics: resulting DataFrame
Women in politics: resulting DataFrame

The DataFrame above contains 2579 articles based on relevant keywords. We will perform an outer join on the keywords and the headlines dataframes in order to obtain a more comprehensive one:

wip_df = pd.concat([wip_headlines, wip_keywords], axis=0, sort = True)

Using the same techniques, we will be able to obtain more data about women in military, science, sports, entrepreneurship and other forms of achievement. For example, if we were to look for the articles about feminism:

feminist_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*feminist)',regex=True)]
Articles based on the keyword search: feminism
Articles based on the keyword search: feminism

#metoo movement:

metoo_keywords = df[df['keywords_joined'].str.contains(r'(?=.*women)(?=.*metoo)(?=.*movement)',regex=True)]

Regular expressions and fuzzy matching allow for nearly endless possibilities. You can see more queries in this notebook.

The final DataFrame, after all the querying is complete, will further be referred to as project_df in the code notebooks on GitHub and in this article.

Let’s look at the article distribution over the years:

ax = df.groupby(df.date.dt.year['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Total Published Every Year', fontdict={'fontsize': 24, 'fontweight': 'medium'})
plt.show()
ax = project_df.groupby('year')['headline'].count().plot(kind='bar', figsize=(20, 6))
ax.set(xlabel='Year', ylabel='Number of Articles')
ax.yaxis.set_tick_params(labelsize='large')
ax.xaxis.label.set_size(18)
ax.yaxis.label.set_size(18)
ax.set_title('Articles About Strong Women (based on relevant keywords) Published Every Year', \
fontdict={'fontsize': 20, 'fontweight': 'medium'})
plt.show()

If we were to superimpose these two graphs, the blue one nearly disappears:

Relevant publications, based on keywords and headlines, are almost invisible once compared to the bulk of articles published over time.
Relevant publications, based on keywords and headlines, are almost invisible once compared to the bulk of articles published over time.

The coverage of women’s issues appears to be modest. I believe it may be due to the fact that the keywords weren’t always coded properly: some were either missing or misleading, thus making it more difficult for a researcher to find the wanted material through the Archive API.

Throughout my analysis, I made an interesting discovery. In the early 1950’s, according to the analysis of n-grams, there were many mentions of professional opportunities for women. A lot of them graduated from universities to become doctors in order to join the navy. I attribute this spike of publicity to the aftermath of the World War II: women were encouraged to join the workforce in order to supplement the military efforts. Remember the Rosie the Riveter poster?

These newspaper clippings were obtained through the TimesMachine, the NYT archive of publications. Image was created by author using those clippings.
These newspaper clippings were obtained through the TimesMachine, the NYT archive of publications. Image was created by author using those clippings.

Even though it’s heart-warming and uplifting to see those kinds of opportunities available to women during the times when not too many doors were open for them, I really wish it wasn’t due to warfare.

N-grams, WordCloud and Sentiment Analysis.

To explore overall term frequencies in headlines:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,3), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(corpus)
frequencies = sum(sparse_matrix).toarray()[0]
ngram_df_project = pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
from wordcloud import WordCloud, STOPWORDS
all_headlines = ' '.join(project_df['headline'].str.lower())stopwords = STOPWORDS
stopwords.add('will')
# Note: you can append your own stopwords to the existing ones.
wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_words=1000, width = 480, height = 480).\
generate(all_headlines)plt.figure(figsize=(20,10))
plt.imshow(wordcloud)
plt.axis("off");
WordCloud created by the code above: most frequent terms are displayed in larger font.
WordCloud created by the code above: most frequent terms are displayed in larger font.

We can also create wordclouds based on features such as various timeframes, or specific keywords. Refer to the notebook for more visuals.

Let’s talk about sentiment analysis. We are going to analyze the sentimentassociated with the headlines, using the NLTK’s Vader library. Can we actually pick up on how the journalists felt about an issue while writing an article?

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()
results = []
for line in project_df.headline:
pol_score = sia.polarity_scores(line)
pol_score['headline'] = line
results.append(pol_score)print(results[:3])

Output:

[{'neg': 0.0, 'neu': 0.845, 'pos': 0.155, 'compound': 0.296, 'headline': 'women doctors join navy; seventeen end their training and are ordered to duty'}, {'neg': 0.18, 'neu': 0.691, 'pos': 0.129, 'compound': -0.2732, 'headline': 'n.y.u. to graduate 21 women doctors; war gave them, as others, an opportunity to enter a medical school'}, {'neg': 0.159, 'neu': 0.725, 'pos': 0.116, 'compound': -0.1531, 'headline': 'greets women doctors; dean says new york medical college has no curbs'}]

Sentiment as a dataframe:

sentiment_df = pd.DataFrame.from_records(results)
dates = project_df['year']
sentiment_df = pd.merge(sentiment_df, dates, left_index=True, right_index=True)

The code above allows us to have a timeline for our sentiment. To simplify the sentiment analysis, we are going to create some new categories for positive, negative and neutral.

sentiment_df['label'] = 0
sentiment_df.loc[sentiment_df['compound'] > 0.2, 'label'] = 1
sentiment_df.loc[sentiment_df['compound'] < -0.2, 'label'] = -1
sentiment_df.head()

To visualize overall sentiment distribution:

sentiment_df.label.value_counts(normalize=True) * 100
Image by author. Template by Slidesgo.
Image by author. Template by Slidesgo.

To visualize sentiment over time:

sns.lineplot(x="year", y="label", data=sentiment_df)
plt.show()
Sentiment is fluctuating due to the problem complexity
Sentiment is fluctuating due to the problem complexity

As you can see, the sentiment fluctuates. It’s not at all unexpected, since women’s issues often encompass heavy subject matter, such as violence and abuse. In these cases, we expect the sentiment to be skewed towards the negative end of the spectrum.

I created a Tableau Dashboard where viewers can interact with the visualization. It’s available through my Tableau Public profile. This dashboard illustrates the keyword distribution over the decades.

Image by author.
Image by author.

Conclusions

The New York Times has visibly improved on equal gender representation throughout the years. If I were to make a suggestion, I would recommend adding on to the keyword listings. When we go further into the past of the Archive API, more comprehensive and robust keywords could facilitate the search.

It is important to keep showcasing female leadership, until it becomes just leadership. Imagine the world, where the adjective “female” is no longer needed to describe achievement, as it becomes redundant. Imagine the world, where there are no “female doctors” or “female engineers”: just doctors and engineers. Founders and politicians. Entrepreneurs, scientists and trailblazers. Our goal as a society is to develop a solid mental model of these titles being held by diverse groups of people. Together, we can achieve that by constantly reminding ourselves and the society around us, that no gender or nationality can be barred from those opportunities.



Upvote


user
Created by

Sasha Prokhorova

Data Scientist at Metis


people
Post

Upvote

Downvote

Comment

Bookmark

Share


Related Articles