Pre-processing Algorithms
Contents
Pre-processing Algorithms#
Pre-processing algorithms make adjustments to the data to mitigate the unfairness of the downstream machine learning model. Pre-processing techniques vary greatly in complexity, ranging from simple techniques that can be executed manually to more complex optimization schemes. In this section, we will cover two pre-processing approaches: relabeling and representation learning.
Note
The goal of this section is not to present a comprehensive overview of the fairness-aware machine learning techniques proposed in the algorithmic fairness literature. Instead, these examples allow you to develop an intuition of the workings, (dis)advantages, and limitations of the fairness-aware machine learning paradigm.
Relabeling#
The relabeling technique was first introduced by Kamiran and Calders1. The authors set out to learn a classifier that optimizes predictive performance, without reproducing the association between a sensitive feature \(A\) and a target variable \(Y\). The authors propose to pre-process the data set to remove any undesirable association between \(A\) and \(Y\), which can be viewed as an attempt at mitigating historical bias. To this end, the relabeling technique assigns different labels (i.e., class) to instances in the training data such that the base rates of sensitive groups in the pre-processed data are equal. In particular, the relabeling technique changes labels of some instances that belong to the group with a lower base rate from positive to negative (‘demotion’), and the same number of instances belonging to the group with a higher base rate from negative to positive (‘promotion’). In this way, the overall class distribution in the training data is maintained.
Randomly selecting instances for relabeling likely results in a loss of predictive performance of a classifier trained on the pre-processed data. For example, relabeling a negative instance with an extremely low probability of belonging to the positive class almost certainly reduces the predictive performance of the classifier. In contrast, a negative instance that has a similar probability of belonging to either the positive or negative class could be given the benefit of the doubt.
Following this intuition, instances closest to the decision boundary are selected for relabeling. A ranker \(R\) is learned on the training data, which estimates the probability of an instance belonging to the positive class. Using the ranker, instances are sorted according to their score. The highest-scoring negative instances of the group with a low base rate are up for promotion (i.e., relabeled from negative to positive), while the lowest-scoring positive instances of the group with a high base are up for demotion (i.e., relabeled from positive to negative). Relabeling is performed greedily until the base rates are equal between sensitive groups.
Relabeling
Kamiran and Calders1 restrict their approach to a single binary sensitive attribute \(A\) and a binary classification problem with target variable \(Y\), where a positive prediction (\(\hat{Y}=1\)) corresponds to the desirable output for data subjects, i.e., receiving a benefit or avoiding a burden.
The input of the algorithm is labeled dataset \(D = \{X, Y\}\) and sensitive feature \(A\).
Compute the base rate, \(P(Y=1 \mid A = a)\), for each sensitive group \(a \in A\). Determine which of the sensitive groups has the lower base rate and vice versa.
\(A=a\) : the group with a low base rate (i.e., \(P(Y=1 \mid A = a) < P(Y=1 \mid A = a')\));
\(A=a'\) : the group with a high base rate.
Learn a ranker \(R\) over dataset \(D\).
Compute the confidence score, \(x(R)=P(Y=1 \mid x)\), for all instances in the training dataset (\(x \in D\)).
Identify relabeling candidates
Promotion candidates: \(pr := \{ x \in D \mid A = a, Y=0 \}\), ordered descending w.r.t. \(x(R)\).
Demotion candidates: \(dem := \{ x \in D \mid A = a', Y=1 \}\), ordered ascending w.r.t. \(x(R)\).
Compute the number of instances that must be relabeled to achieve equal base rates. Let \(P_{a} := P(Y=1 \mid A = a)\) and \(P_{a'} := P(Y=1 \mid A = a')\) correspond to the ground-truth base rates of group \(a\) and \(a'\) respectively. Furthermore, let \(n_{a} := |\{ x \in D \mid x(A) = a \}|\) and \(n_{a'} := |\{ x \in D \mid x(A) = a' \}|\) correspond to the number of instances that belong to group \(a\) and \(a'\) respectively. Then, the number of instances that must be relabeled to acieve equal base rates is equal to: $\(M = \frac{(P_{a'} - P_{a}) \times n_{a} \times n_{a'}}{N}\)\( If \)M$ is not a whole number, the authors propose to round it up.
Relabel instances:
Relabel the top-\(M\) of \(pr\) from \(Y=0\) to \(Y=1\) (‘promote’);
Relabel the top-\(M\) of \(dem\) from \(Y=1\) to \(Y=0\) (‘demote’).
Let’s have a look at an example. Consider the toy data set displayed in Table 3. Assuming nationality
is considered a sensitive feature and a positive prediction (i.e., \(\hat{Y}=1\)) corresponds to a benefit, we can determine which labels would be changed as follows.
Nationality |
Highest Degree |
Job Type |
\(Y\) |
\(x(R)\) |
|
---|---|---|---|---|---|
1 |
citizen |
university |
board |
1 |
.99 |
2 |
citizen |
high school |
board |
1 |
.90 |
3 |
citizen |
university |
education |
1 |
.92 |
4 |
noncitizen |
university |
healthcare |
1 |
.72 |
5 |
noncitizen |
none |
healthcare |
0 |
.44 |
6 |
noncitizen |
high school |
board |
0 |
.09 |
7 |
citizen |
university |
education |
0 |
.66 |
8 |
citizen |
none |
healthcare |
1 |
.66 |
9 |
noncitizen |
high school |
education |
0 |
.02 |
10 |
citizen |
university |
board |
1 |
.92 |
First, we need to determine the base rates of sensitive groups. Considering the \(Y\) column, 5 of the 6 citizen
instances belong to class \(Y=1\), while only 1 of the 4 noncitizen
instances belong to class \(Y=1\). Consequently, low-scoring positive instances of citizen
are up for demotion, while high-scoring negative instances of noncitizen
are up for promotion.
Now we need to determine how many instances ought to be relabeled to achieve equal base rates. We have \(P_a = 1/4\), \(n_a = 4\), \(P_{a'} = 5/6\), \(n_{a'}=6\), and \(N = 10\). As such, \(M = (5/6 - 1/4)*4*6/10 = 1.4\) instances much be relabeled. Following the procedure as suggested by the authors, we round up to 2.
In this case, therefore, instances 8 and 2 will be relabeled from positive to negative and instances 5 and 6 will be relabeled from negative to positive. In the pre-processed dataset, the base rates for citizen
and noncitizen
are 3/6 and 3/4 respectively. Note that by rounding up the number of relabeled instances the overall class distribution remains the same (6/10) but now the base rate of noncitizen
is higher than citizen
.
Representation Learning#
Relabeling makes explicit adjustments to the training data to obscure the association between a sensitive feature and the target variable. Computer science researchers have also proposed more abstract pre-processing approaches, in which the problem is formulated as an optimization task to identify a new representation of the data. The first of this kind is the representation learning approach proposed by Zemel et al.2.
Zemel et al.2 approach unfairness mitigation as an optimization problem of finding a good representation of the data with two - competing - goals: (1) encoding the data as well as possible, while (2) obscuring information regarding sensitive group membership. The representation of choice is a probabilistic mapping from instances in the original dataset to a set of prototypes. A prototype can be viewed as an artificial instance that is representative of a cluster of instances. A prototype is represented by a vector \(v_k\), which is in the same space as the instances \(x \in X\). The learned mapping assigns a probability to each of the instances in the dataset of mapping onto each of the prototypes \(k \in Z\) (Fig. 4). Based on these probabilities, the final transformation of the data set stochastically assigns each input example to one of the prototypes.
The probabilistic mapping is defined based on the distance between an instance and a prototype, quantified via a distance function \(d(x,v_k)\) (Fig. 5). Typical choices of distance functions are Euclidean distance or cosine distance. The distances are transformed to probabilities via the softmax function, i.e., \(P(Z=k \mid x) = \frac{\exp(-d(x,v_k))}{\sum_{k=1}^K exp(-d(x,v_j))}\).
Learning the mapping \(X \rightarrow Z\) is formulated as an optimization task with three goals.
Information regarding sensitive group membership is lost by requiring \(P(Z=k \mid A=a) =P(Z=k \mid A=a')\). That is, the probability of mapping upon a particular prototype should be independent of sensitive group membership.
Information in \(X\) is retained as best as possible;
The mapping \(X \rightarrow Z \rightarrow Y\) is close to the mapping \(f : X \rightarrow Y\).
Each of these aims is translated to a term in the objective function that is used to learn the representations. For the exact details of the formulation of the objective function, we refer the interested reader to Zemel et al.2.
References#
- 1(1,2)
Faisal Kamiran and Toon Calders. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1–33, 2012.
- 2(1,2,3)
Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, 325–333. PMLR, 2013.