Envision a nurse.
Now envision a welder.
Was there any change in gender?
Words have many biases embedded into them based on the way we use them every day. U.S. adults spend more than half their day consuming media, and it’s paramount we recognize what gender labels are being perpetuated in the news and the internet. This visualization looks at 3 billion words used in English news articles shared through Google News in 2019, and plots how close each word is associated with the male or female gender.
This visualization looks at how common certain words are used with gendered pronouns such as “he” and “she” in the news. This was done by looking at 3 billion words and analyzing 300 billion of their usage associations from articles shared through Google News (found here). These 3 billion words were converted into high-dimensional vectors based on these associations using Word2Vec, which allows word associations to be measurable.
To visualize the gender association, the word vectors are manifold folded into two-dimensional space by a T-SNE algorithm and projected on a two dimensional line between “he” and “she” using Pandas. This allows any words to be measured on a gender spectrum. The selected categories are made by finding that specific word in high-dimensional space, and gathering the closest 500 words to it, which are then plotted on a beeswarm graph in D3.
The way this visualization was made sends an equally important
message beyond just gender labels in the news. The process
used here is common as a preprocessing step to text-based
machine learning. Unfortunately, our models have learned to
capture the biases present in the real-life data on which we
train them. When we train our machine learning models on
embeddings like these, a recruiter searching for "programmers"
will leave female resumes at the bottom of the pile. Gender
bias needs to be mitigated in these models or gender biases
will continue to perpetuate.