When I first got into data science, I was blown away by how many hidden relationships you can unveil from data. I felt that I was making the world a better and more efficient place by optimising human behaviour. But there came a point when I started to face more and more moral dilemmas. Even after getting rid of all sensitive data, you still can’t sit back and enjoy modelling. That’s Kaggle. Real-life problems come with real-life responsibilities. You might unleash algorithms that are good by data science standards but horrible on a social level. Here are some dilemmas that you are doomed to face sooner or later as a data scientist.
When an algorithm is deployed in production, it inevitably creates a feedback loop.
For example, there are many models for predicting the probability of loan default. A customer who’s considered too risky by such a model will be offered a loan with worse conditions than their less risky counterpart, thus raising the probability of default. Which means that people like them will be considered even more unreliable by future models (positive feedback loop). And the lucky ones who are deemed trustworthy by the model will receive bonuses, discounts and eventually become less likely to go default (negative feedback loop).
Data usually doesn’t exactly measure what we want it to measure, so we are forced to use proxy variables. And proxy variables are often biased. Shall we use proxy features that are strongly correlated with features that we would find unethical, sometimes even illegal, to include in any analysis?
For example, you might be tempted to improve your credit scoring model by including housing conditions. But in many parts of the world, housing conditions are strongly correlated to race. In fact, the correlation to race can be much stronger than whether you’ve actually repaid your debt or not. Including this feature might improve credit score models, but only at the cost of implicitly introducing a strong racial bias.
Accuracy versus interpretability
Black box models like xgboost or neural networks are great in terms of accuracy and they can discover hidden relations across dimensions. Being black box means that if the model classifies you as a false positive, you can’t really tell the reason why that happened. You’re basically collateral damage without any further explanation.
On the other hand, even though white box models like linear or logistic regression have excellent interpretability, they are very bad at discovering hidden or nonlinear correlations.
The essence of the moral dilemma is this: do you want to be fair and transparent or opaque and efficient?
Biased target KPIs
Usually, our goals are only vaguely defined (by business or marketing), so we need to decide on the target KPI that measures it. This target is often seriously biased, too.
For example, if you want to evaluate customer engagement for a bank’s mobile app, you might define a KPI that measures the time that a customer spends using the app. However, if you blindly follow this KPI, you could end up sticking with an app that has longer loading times and frequently crashes. Customers will most probably not churn just because of the bank’s mobile app, and you will see better and better numbers. But in fact, the KPI measured the exact opposite of what it was supposed to measure.
Weapons of Math Destruction
These data science nightmare scenarios are aptly summarised in Cathy O’Neil’s book, Weapons of Math Destruction. She argues that blindly relying on data-based algorithms creates the risk of a society that entrenches existing inequalities and destroys unlucky people who just happen to be false positives. Being a data scientist herself, she calls the algorithms she has created WMDs. (You should read the book for the pun alone.)
For every company that uses data, there lies the question: how to deal with this complex problem? You can go fully transparent, but then your competitiveness will suffer. Or you can go fully efficient and build WMDs that exploit and eventually destroy your customer base.
Here’s a different perspective for you to consider: the tactical de-escalation of WMDs.
Who controls the data
Even though public trust in traditional banks has never been particularly high, they’ve been able to survive so far because there has been no real challenge to their monopolies. But with the rise of challenger banks like Revolut, Transferwise or Monzo, the game is rapidly changing. And traditional banks have a good chance of losing their competitive edge.
There is a very real possibility that traditional banks will lose access to all useful data about their customers. You will only see “€200 transferred to Revolut”, but you’ll have no idea what your customer needed that money for, which restaurants she likes or when she plans to go on a holiday. Instead, traditional banks will end up with all the “boring” data like utility payments and loan repayments. And after a while nothing but pension transfers.
Traditional banks still have an advantage in terms of the amount of data they have, but this advantage is melting away fast. We can already see this happening and we can safely say that this trend is only going to accelerate. Preventing this from happening is make or break for traditional banks. Regaining trust and improving customer engagement will become a means of survival.
Engaging customers through education
You can take the approach of not unleashing WMDs on your customers.
In the short term, this might mean some “missed opportunities”. For example, refraining from sending credit card offers to people who don’t actually need them. Or instead of exploiting customers in temporary financial straits and offering them payday loans, you could give them advice on how to avoid financial troubles and teach them the ins and outs of budget planning and keeping track of expenses.
Guiding your customers’ behaviour towards healthy and responsible financial decisions is one way of building trust. Plus, they might also become more resistant to the effects of the WMDs of your competitors who want to lure them away.
US Senator Elizabeth Warren had worked as a bankruptcy lawyer for a long time. In her book, All Your Worth (co-authored with her daughter) she explains the importance of budget planning. Her main point is that a middle-class income doesn’t automatically ensure a middle-class lifestyle. It did until the ’80s but not anymore. Responsible financial planning is a must to achieve and maintain that lifestyle.
Her advice is easy. Plan your budget with a simple rule of thumb in mind. Separate your spendings in three groups: need (bills, loan installments, grocery etc.), want (entertainment, hobbies, going out, holiday etc.) and save (both for retirement and for unexpected events). Try to keep these categories proportionate to 50-30-20 respectively, regardless of your income.
The numbers are specific to the US (think low taxes and low welfare services) but the core idea is global: planning and keeping a reasonable budget month after month ensures financial stability and a sustainable lifestyle.
A good algorithm
Warren’s book was written based on personal counselling examples she had encountered as a bankruptcy lawyer. Unfortunately, you will not get any more personal advice from her these days as she’s too busy running for president.
But here comes the point where a data scientist sees an opportunity for creating a WMD of a different kind by automating the senator’s book. We can build an automated budget planning algorithm that helps customers keep their 50-30-20 budget, find opportunities for saving, get rid of money-wasting habits and optimise spending patterns for a sustainable, long-term budget plan. We can identify (sometimes even predict) anomalies that jeopardise the long-term balance of plans and identify banking products that the customer actually needs.
We can deploy new kinds of WMDs that are fair and profitable at the same time. The only thing is that it’s not really fair to call these algorithms WMDs anymore. We’ll need to find a new, fitting name for our tamed, cute little algorithms.