A Few Faces of Bias

One of the biggest killers of any Data Analysis project is Bias. Bias in data can appear in many forms. Today I will briefly describe three types of bias and how to avoid them.

Algorithm Bias

This is the kind of bias that results from selecting a machine learning model that that is too simple for the problem at hand. A classic example of this is using a linear regression model on data where no linear relationship exists between the predictors and the attribute being predicted. The model is unable to capture all of the signal in the dataset. You'll know you're dealing with this kind of bias when you see a very high inaccuracies in both your training and test set. 

To avoid this error:

Experiment with models that have more complexity. Use cross validation techniques to measure the performance of the your models and select a model that has an optimal balance between bias and variance.

Measurement Bias

This kind of bias happens as a result of errors in the data collection process. A few examples of this kind of bias is:

  • Image classification system where data is collected from a camera that has a much higher image quality than what will be used in production
  • Incorrectly labeled audio files for a project where the goal is to build a model that identifies male voices from female voices
  • A survey containing leading questions which influences answers in a particular direction.

To avoid this error:

  • Compare the outputs of the measurement tools used for data collection to make sure they are consistent with the tools used in production.
  • Properly train labelers and annotation workers before putting them to work on data.
  • Comply with survey best practices

Sample Bias

This kind of bias is also a result of faults in the data collection process. Whereas measurement bias is a result of measurement errors in the data collection, this kind of bias is caused when the data used comes from a sample of individuals that is not representative of the population of interest.

Amazon's failed AI Recruiting Tool, which I wrote about in my last post, is one great example of this type of bias. Because the dataset used for the product came from resumes submitted from mostly male applicants, the AI tool had a strong preference for male candidates over female candidates.

To avoid this error:

  • Clearly identify the goals of the project the data will be collected for and the intended audience the project is meant for.
  • Use random sampling techniques  to make sure that every member of the population has an equal chance of being selected.
  • Double check work to make sure no mistakes were made.