Fairness Forensics

The term "Fairness Forensics" was first coined by Kate Crawford during her NIPS 2017 Keynote (The Trouble with Bias (youtube)). The phrase represents a call for researchers to investigate possible bias in deployed systems. Below are two projects that answer this call.

Detecting Simpson's Paradox

Simpson’s paradox is the phenomenon that a trend of an association in the whole population reverses within the subpopulations defined by a categorical variable. Detecting Simpson’s paradox indicates surprising and interesting patterns of the data set for the user. It is generally discussed in terms of binary variables, but studies for the exploration of it for continuous variables are relatively rare.

Our work describes a method to discover Simpson’s paradox for pairs of continuous variables. The correlation coefficient is used to indicate the association between a pair of continuous variables. We use categorical variables to partition the whole dataset into groups. Our algorithm’s goal is to find the sign reversal between the coefficient correlations measured in the group relative to the original entire data. We show that our approach detects cases in real data sets as well as synthetic data sets, and demonstrate that our approach can uncover the hidden surprising patterns by detecting occurrences of Simpson’s paradox.

We also propose an approach that exploits sampled data for early Simpson’s paradox detection. We show the running time for the algorithm by examining through the combination of different conditions. This is especially challenging since the Simpson’s paradox can be hard to identify and interpret in the big data set. We want to develop a new visual and interactive techniques to explore the data for Simpson’s paradox.


Chenguang Xu, Sarah M. Brown, Christan Grant. Detecting Simpsons Paradox. The 31st International Florida Artificial Intelligence Research Society (FLAIRS) Conference. Melbourne, Florida. 2018.

Demographic Inference over Social Media

The goals of this project are to systematically assess the representativeness and quality of social media data from Twitter from a public health perspective based on specific indicators; demographic (gender, age, ethnicity, education, income, profession, religion), and geographic variables (county representation). We evaluate this data resource using prospectively and retrospectively collected data, and an example of a communicable (e.g. gastrointestinal disorders) and non-communicable (e.g. coronary heart disease) disease.


Nina Cesare, Christan Grant, Elaine Nsoesie. Estimating Obesity Prevalence By Age and Gender Using Social Media Data. The Annual Meeting for the Population Association of America (PAA). Denver, CO. 2018.

Nina Cesare, Christan Grant, Jared Hawkins, John S. Brownstein, Elaine O. Nsoesie. Demographics in Social Media Data for Public Health Research: Does it matter?. Bloomberg Data for Good Exchange. New York, NY, USA. 2017.

Nina Cesare, Christan Grant, Elaine O. Nsoesie. Detection of User Demographics on Social Media: A Review of Methods and Recommendations for Best Practices. Arxiv. 2017.