- Keywords: twitter, demographics, selection bias, self-report
- Abstract: Computational social science studies often contextualize content analysis within standard demographics. Since demographic attributes are unavailable on many social media platforms, such as Twitter, numerous studies have inferred demographic traits automatically. Despite many studies presenting proof of concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, errorful, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite errors inherent in automated supervision, we train models sufficiently accurate to identify demographics when measured on a gold standard self-report survey. The result is a reproducible method for creating large-scale training resources for race and ethnicity.