Imbalanced Data and Algorithmic Fairness
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from matplotlib import pyplot as plt
from numpy import where
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScalerHandling the Imbalanced Data Problem
We will start by generating 2-D imbalanced data from two classes (9900 from class 0 and 100 from class 1).
X_org, y_org = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y_org)
print("Before Running SMOTE: ", counter)
for label, _ in counter.items():
row_ix = where(y_org == label)[0]
plt.scatter(X_org[row_ix, 0], X_org[row_ix, 1], label=str(label))
plt.legend()
plt.show()We will split the data into training and testing set with 70/30 ratio. We will train a logistic regression model on the training data and evaluate its performance on the test data using three performance measures F1-Score, Accuracy, Balanced Accuracy.
X_train, X_test, y_train, y_test = train_test_split(X_org, y_org, test_size = 0.30,
shuffle = True,
stratify = y_org)
# Plot the training
counter = Counter(y_train)
print("In training: ", counter)
for label, _ in counter.items():
row_ix = where(y_train == label)[0]
plt.scatter(X_train[row_ix, 0], X_train[row_ix, 1], label=str(label))
plt.legend()
plt.show()
# Plot the testing set
counter = Counter(y_test)
for label, _ in counter.items():
row_ix = where(y_test == label)[0]
plt.scatter(X_test[row_ix, 0], X_test[row_ix, 1], label=str(label))
plt.legend()
plt.show()
len(X_train), len(X_test)# Train a logisitic regression model
# Train the model with the training subset
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Test the model on the test subset
result = logreg.predict(X_test)
# Display the confusion matrix to check the performance of your model
confusion_matrix(result, y_test)f1 = f1_score(y_test, result, average=None)
acc = accuracy_score(y_test, result, normalize=True)
bacc = balanced_accuracy_score(y_test, result)
print ("F1-Score = {}\nAccuracy = {}\nBalanced Accuracy = {}".format(f1, acc, bacc))To solve the imbalanced data problem, we will practice on two techniques:
- Generating more examples from the class of minority.
- Using cost sensitive classification.
Synthetic Data Generation
We start with generating synthetic data from the minority using SMOTE and GANs.
Cost sensitive classification
Specifying the cost of misclassification is another technique for forcing the classifiers to predict the class label of the minority more accurately. Use the original imbalanced data and check the sklearn documentation about setting the class_weight to define a cost sensitive classifier. Test the performance of the classifier on the test set using the same performance measures (F1-Score, Accuracy and Balanced Accuracy).
Algorithmic Fairness
In this Exercise, we will measure the bias in a given dataset and apply some bias mitigation algorithms. You will need to install the AIF360 library using:
!pip install aif360
This library contains implementation of a wide variety of bias mitigation algorithms including pre/in/post-processing algorithms. However, you will need to upload the dataset that you would like to work with in the AIF360 directory in Google colab. To do so, download the dataset from the specified link on the GitHub. It is recommended to use the German Credit Data. After downloading the data, go to Google colab and click on the button that points up as in the figure below. When you see the folders’ tree, change to: (usr/local/lib/pythonX.XX/dist-packages/aif360/data/raw/german and click on the three vertical dots next to that name then select Upload. Browse to the location where you saved the dataset and upload it.
Since different users may have different versions of Python on Google colab, pythonX.XX was used in the path where you need to upload the data.