Fairness#
Lead Scribe: Chan
FAIRNESS AND MACHINE LEARNING#
Chapter 2: Classification#
Presented by Damon
Main Goal#
determine plausible value for an unknown varible Y given an obscent variable X
Essentially trying to predict something based on other factors
Supervised learning#
The prevalent method for onstructing classifeirs from observed data
Goal: to identify menatingful patterns
Classifier: a mapping from the space of possible values for X to the space of values that the target Y can assume.
The essential idea is simple in that we will have albeld ata in the form (x1, y1) that is drawn from a distribtuion where x1 is an instance and y1 is the label.
The instance are then partitioned into positive
Lables typically come from a discrete set such as (-1, 1) in the case of binary classification.
Statistical Classification Criteria#
How to identify which classifer is best fo your purpose
Accuracy: the probability of correctly predicting the target variable
Cond. probability table
The true positivie rate corresponds to the frequency of the classifer correctly assinging a positive label to a positive instance.
Why do we need the weighted average?
P(Y=Ŷ) = P(Ŷ=1|Y=1)
P(Y=1)
+ P(Ŷ=0|Y=0)P(Y=0)
to get all of the accuracy, we need to multiple the weights to our conditional probabilities so that it is computable.
The Conditional Expectation#
Score: a single real-valued variable summairzed from a regression model
A natural score function is the expectation of the target variable Y conditional on the features X we have observed
Sensitive Characteristics (A)#
In many classification tasks, the features X contain or implicitly encode sensitive characteristics of an individual
It is dangerous to ignore these factors, which will make prediction difficult by tampering the correlation.
Indepdence – R⏊A#
Requires the sentitive charactieric to be statistically independent of the score
Definition 1: The random variable (A, R) satisfy independence if A⏊R
Independence simplifies to the condition:
P{R=1∣A=a}=P{R=1∣A=b}
Where R=1 is acceptance and the condition requires the acceptance rate to be the same in all groups.
There coudl also be a relaxation on the constraint that would introduce a positive amount of slack ε>- and require that:
P{R=1∣A=a}≥P{R=1∣A=b} − ϵ
= ϵ ≥ P{R=1∣A=b} - P{R=1∣A=a}
, where we want the difference to be small.
Separation#
Acknowledges that in many scenarios, the sensitive characteristic may be correlated with the target variable.
This separation criterion allows corelation between the score and the senstitive attribute to the extent that it is justified by the target variable
Definition 2: Random variables (R,A,Y) satisfy separation if R⏊A|Y
In the case where R is a binary classifier, separation is equivalent to requiring for all groups a, b the two constraints:
P{R=1∣Y=1,A=a}=P{R=1∣Y=1,A=b}
P{R=1∣Y=0,A=a}=P{R=1∣Y=0,A=b}
Separation requires that all groups experience the same false negative rate and the same false positive rate.
Sufficiency – Y⏊A|R#
Formalizes that the score already includes the sensitvie characteristic for the purpose of predicting the target
Definition 3: We say that random variables (R, A, Y) satistfy sufficienty if Y⏊A|R
In this case, a random variable R is sufficient for A iff for all groups a, b and all values r in the support of R, we have
When R only has 2 values we recognize this condition as requiring a parity of pos./neg. predictive values across all groups.
Calibration and Sufficiency#
It is sometimes desirable to be able to interpret the values of the score functions as probabilities
This condition means that the set of all instances assigned a score value r has an r fraction of positive instances among them.
Calibration by Group#
To formalize the connection between sufficiency and calibration we say that the score R satisfies calibration by group if it satisfies:
Proposition 1. If a score R satisfies sufficiency, then there exists a function 𝓵: [0,1] --> [0,1] so that 𝓵(R) satisfies calibration by group.
Relationships Between Criteria#
Initial criteria constrains the joint distribution in non-trivial ways therefore we can suspect that imposing any 2 of them simultaneously over-contrains the space to the point where only degenerate solutions remain.
Happy path of establishing multiple fairnesses is intangible.
Machine Bias#
Presented by Derek
The premise#
Brisha Borden & a friend - 18 years old
Scooter
Vernon Prater - “Seasoned”
Shoplifting
Both similar crimes
Computer program predicting likelihood of recidivism
Borden > Prater
Two years later, opposite held true
Risk Assessments#
Becoming increasingly more popular in courtrooms
9 states provide scores to judges during sentencing
Following the Warnings of AG Holder#
The sentencing commission did not launch a study of risk scores
ProPublica did launch a study
>7000 people arrested in Broward Cty, FL
Check how many were charged with new crimes over 2 years
Results
Unreliable
Biased
In Theory#
Risk scores are great
High scores mean more likely to commit crime
Vice versa for low scores
Simple…?
In Practice#
People are extremely complex
Countless factors go into recidivism such as
employment
housing status
fiancial situations
As a result unless we individualize the risk assessment, result will be biased
The Problem#
people tend to blindly trust tech.
any jurisdictions adopt Northpointe’s software without testing
Using tech is the easy way out
“Easy to use and gives ‘simple/effective’ charts for judicial review”
2009 study reported a 68% accruacy rate for the scoring software
Overall#
Using tech in social context is extremely complex
Poeple are unique and countless factors go into individual behavior
With recidivism specificialy
Black people more likely to be predicted higher than White people
Software relied upon too much
Studies have concluded risk scores do not reflect reality.
Gender Shades#
Presented by Derek
Intro#
AI is widely used in society
Facial recognitiion software can realistically be used to identify suspects
Algo. trained with biased data result in algorithmic discrimination
This work focuses on facial recognition to gender classification
Intersectional Benchmark#
Gender classification means we need defined classes
The dataset should contain varied phsyical attributes for subgroup accuracy analysis
Pilot Parliaments Benchmarks dataset
high quality photos
reliable sources
Commercial Gender Classification Audit#
Classifiers perform best on lighter, male faces
Classifiers perform worst on darker female faces
Microsoft, IBM, Face++ analyzed
Max disparity in error rate between best/worst classified groups is 34.4%
Dissecting racial bias in an algorithm used to manage the health of populations#
Presented by Derek
Black people had to have more chronic conditions to have a higher risk score
Training used total medical expenditure.
Could have used total sickness as a factor to train to avoid bias.