A case and the approach for a stronger data science model governance

In light of some credit limit discrimination news, in this article, we will go about to address the social agitation around the suspicion of sensitive aspects like gender being taken into account into the credit scoring models, the possibilities that could have resulted to this outcome and to talk about how to govern the model management better to ensure that these models adhere to legal, privacy, and ethical concerns.

Background of Credit Score/Credit Risk

Credit scoring methods are statistical tools employed by banks and other financial institutions, marketing and advertisement companies. Basically, credit scoring aims to determine whether a customer should receive a favorable lending decision. The decision is typically modeled as a function of several input variables: Banks collect information about the applicant from various sources such as historical data, questionnaires, and interviews. This aims at collecting demographic and financial details such as income, age, type of loan, nationality, job, and income pertaining to the applicant. The accurate prediction of consumer credit risk is indispensable for lending organizations. Credit scoring is a common technique that helps financial institutions evaluate the likelihood of a credit applicant as a defaulter for a financial obligation and decide whether to grant credit or not. The precise judgment of creditworthiness allows financial institutions to increase the volume of granted credit while minimizing possible risk/losses.

Possible scenarios that could explain the above situation

Keeping in mind the credit limit scenario that created some news and awareness around credit scoring algorithms (a.k.a. Models), any model may appear to be opaque in explaining the decision it arrived on credit limit; it isn’t black magic. It should have been built, trained and tested by taking a set of features as input, using an algorithm along with various parameters, to arrive at a result.

With that said, taking ‘credit limit decisioning model’ as an example, the following are the possibilities that could explain the possible gender discriminating situation:

  • Credit score, credit report, and income have been taken as prominent features without the inclusion of gender at all. It could be that "income" is the target variable played a crucial and coincidental role in deciding the fate of most individuals' credit limit irrespective of the gender wherein the majority of women fell in the category of likely defaulters thereby highlighting discrimination. In other words, gender could highly correlate with some other set of features, such as income. This correlation could be causal, or it could be spurious.
  • "Gender" was deliberately taken as the target feature to target men as the major audience for the credit card keeping in mind that in spite of helping two members from the same family avail the same credit card, the wives, in particular, would remain guided/dependent/directed by their husbands. (This is discussed only as a theoretical possibility. Equal Credit Opportunity Act (ECOA) outright prohibits anyone from doing this.)
  • Gender might not have been taken as a feature to feed into the algorithm at all but it might have been proxies for the gender attribute within the data set that must have caused the bias to occur unintentionally. The algorithm used does not necessarily cause the bias but sometimes it’s the correlation data or the proxy data that lead to bias. (Even though we are NOT passing Gender as a variable into the model, by looking at the transaction history, the model could bucket transaction behavior into 2 categories; one predominantly done by men and other by women.)

"The gender as an input variable could either be taken or left out, explicitly. Alternatively, there could be other proxy variables that enabled the system to determine and group consumers into male and female."

Inference drawn from this case to a larger context

The issue with the credit limit generated by a Credit Scoring model (a.k.a. Data Science Models, Machine Learning Models, AI/ML Models) can be related by a layman as it is a prominent decision faced by every credit cardholder. But there are very many Data Science Models ranging from Firm’s risk/health check to consumer marketing emails to product recommendations to images we see against a movie suggestion. These Data Science Models for good (or bad) are hyper-personalized to deliver a rich and contextualized experience to a user.

"Data Science Models for good (or bad) is hyper-personalized to deliver a rich and contextualized experience to a user. While the algorithms are not inherently biased, what goes in and how we train them has a strong say on what’s comes out."

Irrespective of any AI/ML model, at a high level, we analyze the raw data set, determine key attributes that are of need; derive features from the raw data; finalize the input variables; decide on the algorithm; and finally build, train, test, and operationalize the model(s). While some AI/ML models are completely opaque as to how it arrives at a result, the key step in determining the input variables are completely driven by human users. As mentioned above, while the models can be black-box; it’s not black magic. What goes in has a strong say on what’s comes out.

Ways to institutionalize governance

Often, we arrive at a scenario where the input variable selection or the data set used to build, train, test, and monitor the model isn’t black-and-white. Added to that, model operationalization has to transparent and auditable at the nth level.

  • Review and approve all input variables:In the case of credit scoring, explicitly ensure that Gender is NOT taken into account. Extending this hypothesis, neither should ethnicity, religious inclination, political affiliation etc.. be taken into account. Good, cite the appropriate regulations in various countries
  • Caution while using unstructured/social data:While it’s easy to ensure Gender as an attribute is NOT used in the model, when extracting information from social media and unstructured data, details related to religious inclination, political affiliation may be taken into account unintentionally. While this may NOT be the case for a credit scoring model, models related to marketing and product recommendation could potentially use the unstructured/social data. The consequence of these actions could lead to reputational damage to the organization.
  • Using features for intended reason:Say we are running a campaign for Women’s day and we want to target women with exclusive offers. Then we may have to let Gender as a feature to be used in a model, but the scope of that variable should not exceed this particular use case/campaign. (In practice, we won’t use the feature inside the model. We will filter the data even before serving it to the model.)
  • Need to obfuscate the exact value: Say there is a risk that your organization wants to take by giving 15% more credit to the younger generation than the Credit Model stated value to increase their life-long affinity to the organization. In this case, you need to know the age. But, knowing the exact age may result in unfairly discriminating against the elderly. In that case, we can bucket the consumers into 3 different age group – 18 to 24, 25 to 44 and 44 to 70. This will ensure that the model rightly rewards the younger generation without unintentionally penalizing the elderly.

"With regard to input variable selection, the rules aren't black and white. Aspects related to permissible use, intended use, regulation and repetitional aspects goes into it."

  • Nature of Training Data Set:The data set used to train the model is as important (if not more) as the underlying algorithm itself. These models (Deep learning for Document Scanning, Fraud Model to predict anomalies) need to be trained with data set that indicates the model for the positive/negative scenarios. This is vital to ensure that the right coverage on the training data set without any bias is fed to the model as the model makes predictions based on what it is trained with.
  • Model Retraining:Since we expect the world to change over time, model deployment should be treated as a continuous process. We need to retrain the models (continuous training and deployment) if they find that the predicted outputs have deviated significantly from those of the original (or newer) test set. This is referred to as model drift and is mitigated by monitoring the model output and comparing the predicted vs actual. The data set used to train the model needs to be refreshed at regular intervals which also calls for the model to be retrained to avoid model decay as much as possible.
  • Model Correctness:Based on the nature of the model there are various measures like confidence score, F1-Score, RMSE, etc., which validates the correctness of the model by taking into account the accuracy, precision, error rate, etc. In layman terms, it determines how far was the model’s prediction/classification/recommendation from the actual.

"Upon model build; training, testing, operationalizing the model is as important the the model build itself."

Besides the rigor and governance, while selecting features, we also have to take consumer consent and regulation into account. While we may have data related to consumers, consent from the individual to ‘use that data for a particular reason’ is essential. Many regulations including GDPR and CCPA puts consumers at the center in managing the use of data related to an individual.

Ensuring an end to end accountability

Establishing an end-to-end lineage for auditing and reporting is essential to any system, let alone critical systems like credit scoring models. To build an efficient, legally compliant, insight-based technology platform, linking raw data to input variables, to the data science model to overall model management is essential. Below are the key steps are taken into consideration while building such a governance protocol:

  • Understand the use case: Gain a clear understanding of the requirements, the business problem we are intending to solve, input variables, hypotheses, nature of the insights to be produced, the permissible usage of data and who has the authority to sign off on decisions.
  • Audit and track all data input: You should define, tag and version all raw and processed data(features) that is producing the insights.
  • Audit the insight models:All the trained model versions including the training data set are used and any hyper-parameters values have to be recorded.
  • Retrieve the insight output:This is the final output the model returns. Whether we are sending that output to a consumer or not, this has to be stored for compliance purposes. Wherever applicable, track the feedback from the end-user (or the actuals) to see the model performance. Observe the model output and see if the output (unintentionally) correlates to controversial features like gender.
  • Monitor the end-to-end pipeline:Build a process to monitor, store and analyze the end-to-end pipeline providing complete traceability into the insight generation. As in the credit limit case, the raw data, features, model algorithm, and the trained model versions resulted in determining the credit limit. This also takes into account the continuous training and decommissioning of model.
  • Audit the insight models:All the trained model versions including the training data set are used and any hyper-parameters values have to be recorded.

"We have to connect the links among the data stewards, data engineers, data scientist, operational support personnals to bring a comprehensive end-to-end visibility across the model pipeline."

In Summary

Technology and compliance go hand in hand, and it is important to have the best technology along with proper governance criteria in place to avoid discrimination or reputationally damaging issues. Along with the excitement of what these machine learning algorithms do goes the governance: defining the business domains for all input attributes, permissible algorithms that can be used, tracing the data transformation of raw attributes, feature engineering jobs, model training, and model monitoring. This balance ensures that we can continue to apply advancements in deep learning, machine learning and neural networks to matters of regulatory, compliance, legal, marketing areas wherein organizations are able to improve their overall effectiveness and efficiency for satisfying regulators, mitigating risk, and preserving their reputations as ethical citizens and institutions.