Imbalanced Data and Algorithmic Fairness
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from matplotlib import pyplot as plt
from numpy import where
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
Handling the Imbalanced Data Problem
We will start by generating 2-D imbalanced data from two classes (9900 from class 0 and 100 from class 1).
= make_classification(n_samples=10000, n_features=2, n_redundant=0,
X_org, y_org =1, weights=[0.99], flip_y=0, random_state=1)
n_clusters_per_class# summarize class distribution
= Counter(y_org)
counter print("Before Running SMOTE: ", counter)
for label, _ in counter.items():
= where(y_org == label)[0]
row_ix 0], X_org[row_ix, 1], label=str(label))
plt.scatter(X_org[row_ix,
plt.legend() plt.show()
We will split the data into training and testing set with 70/30 ratio. We will train a logistic regression model on the training data and evaluate its performance on the test data using three performance measures F1-Score, Accuracy, Balanced Accuracy
.
= train_test_split(X_org, y_org, test_size = 0.30,
X_train, X_test, y_train, y_test = True,
shuffle = y_org)
stratify # Plot the training
= Counter(y_train)
counter print("In training: ", counter)
for label, _ in counter.items():
= where(y_train == label)[0]
row_ix 0], X_train[row_ix, 1], label=str(label))
plt.scatter(X_train[row_ix,
plt.legend()
plt.show()
# Plot the testing set
= Counter(y_test)
counter for label, _ in counter.items():
= where(y_test == label)[0]
row_ix 0], X_test[row_ix, 1], label=str(label))
plt.scatter(X_test[row_ix,
plt.legend()
plt.show()
len(X_train), len(X_test)
# Train a logisitic regression model
# Train the model with the training subset
= LogisticRegression()
logreg
logreg.fit(X_train, y_train)
# Test the model on the test subset
= logreg.predict(X_test)
result # Display the confusion matrix to check the performance of your model
confusion_matrix(result, y_test)
= f1_score(y_test, result, average=None)
f1 = accuracy_score(y_test, result, normalize=True)
acc = balanced_accuracy_score(y_test, result)
bacc print ("F1-Score = {}\nAccuracy = {}\nBalanced Accuracy = {}".format(f1, acc, bacc))
To solve the imbalanced data problem, we will practice on two techniques:
- Generating more examples from the class of minority.
- Using cost sensitive classification.
Synthetic Data Generation
We start with generating synthetic data from the minority using SMOTE
and GANs
.
Cost sensitive classification
Specifying the cost of misclassification is another technique for forcing the classifiers to predict the class label of the minority more accurately. Use the original imbalanced data and check the sklearn documentation about setting the class_weight
to define a cost sensitive classifier. Test the performance of the classifier on the test set using the same performance measures (F1-Score, Accuracy and Balanced Accuracy).
Algorithmic Fairness
In this Exercise, we will measure the bias in a given dataset and apply some bias mitigation algorithms. You will need to install the AIF360
library using:
!pip install aif360
This library contains implementation of a wide variety of bias mitigation algorithms including pre/in/post-processing algorithms. However, you will need to upload the dataset that you would like to work with in the AIF360
directory in Google colab. To do so, download the dataset from the specified link on the GitHub. It is recommended to use the German Credit Data. After downloading the data, go to Google colab and click on the button that points up as in the figure below. When you see the folders’ tree, change to: (usr/local/lib/pythonX.XX/dist-packages/aif360/data/raw/german
and click on the three vertical dots next to that name then select Upload
. Browse to the location where you saved the dataset and upload it.
Since different users may have different versions of Python on Google colab, pythonX.XX was used in the path where you need to upload the data.