Why is AI "a sea of dudes"? Using data science and NLP methods to understand gender imbalance in a scientific community.
Abstract: This dissertation carries an in-depth study of gender in the field of Computation Linguistics. Our approach relies heavily on information that we extract directly from the
data, using tools that the very field we are investigating promotes.
We perform gender attribution on the authors present in a corpus and investigate new
gender classification methods, including character-level LSTMs and face recognition.
We then perform a quantitative analysis the publication patterns of these authors,
focusing on career development over time, collaboration through coautorship and
conference rankings. Most of our results are statistically significant and help paint
the landscape of the field. We find that women are underrepresented in the last author
position. What is more, men have a higher number of active years in the field and a
higher number of publications per active years. In terms of collaboration, females tend
to coauthor more papers with other female authors. Another concerning finding is that
women are underrepresented at the highest ranked conferences.
We employ topic modeling to capture how the shift in the field of Computation Linguistics affects the gender gap and contrast this with earlier findings. We report significant
differences in the topics that each gender is more likely to choose. Finally, we look
at the effect of an online publishing repository (arXiv), as opposed to a traditional
corpus(ACL).
Our analysis suggests that there are subtle ways in which gender differences can occur in
scholarly authorship and practitioners should be aware of the dangers of any unconscious
gender bias.
0 Replies
Loading