When you’re analysing huge amounts of Big Data, an algorithm that’s 99% accurate may not be sufficient. We use Bayes’ theorem to explain why you need better algorithms and better-quality data to accurately pinpoint threats and opportunities.
In this article, I explain why Data Quality and data lineage are so important for anybody looking for needles in haystacks using advanced analytics or Big Data techniques. This is especially relevant when you’re trawling through large datasets looking for a rarely occurring phenomenon – and when the impact of getting it wrong could be substantial. Ultimately, it comes down to this: the error rate in your end result needs to be smaller than the population you’re looking for. If it isn’t, your results will contain more errors than valid results. Bayes’ theorem is the key to understanding why.
By: Mark Humphries, IPL, Mark.Humphries@ipl.com
I discovered this phenomenon last year when I applied business analytics to a problem. I felt that this approach would yield real results. The idea was to try to predict a certain kind of (undesirable) customer behaviour as early as possible, and then try to prevent it before it happened. While the predictive models that came out of this exercise were good, they were ultimately unusable, because the number of false positives was unacceptably high. What was most surprising was the fact that the false positives outnumbered the true positives.
In other words, when the algorithm predicted the undesirable outcome we were looking for, the chances were that it was wrong… and the impact of using it would effectively have meant unacceptable discrimination against customers. I was surprised and disappointed, because this was a good algorithm with a good lift factor, but ultimately unusable. Bayes’ theorem At the same time, I was reading a book called Super Crunchers, by Ian Ayres (which I thoroughly recommend, especially if you’re looking for a readable account of what can and can’t be done through number-crunching).
Towards the end of the book is an explanation of Bayes’ theorem and how to apply it when interpreting any kind of predictive algorithm or test. I learned Bayes’ theorem at school, and this chapter was a really useful reminder. The theorem enables you to infer a hidden probability based on known, measured probabilities. When I applied it to the problem described above, it made a lot of sense. How 99% accuracy can lead to a 50% error rate What I learnt was this: If you’re applying an algorithm to a large population, with the aim of finding something that only affects a small minority of that population, you need to have a good idea of how reliable your end result is in order to work out how many of its predictions are wrong.
The end result is highly dependent on the quality of the data that you start with, the quality of the analytics algorithms you use, and any intermediate processing you do. What’s more, if the error rate is about the same as the size of the population you’re looking for, then around half the predictions will be false. So, if you’re looking for something that affects 1% of your population and you have an algorithm that is 99% accurate, then half of its predications will be wrong.
To demonstrate this, I’ll use a hypothetical credit screening scenario. Imagine that a bank has a creditworthiness screening test that is 99% accurate, and it applies it to everyone who applies for loans. The bank knows from experience that 1% of its customers will default. The question then is, of those customers identified as at risk of default, how many will actually default? This is exactly the sort of problem that Bayes’ theorem answers.
Let’s see how this would work for a screening 10,000 applicants. Of the 10,000 customers, 100 will default on their loan. The test is 99% accurate, so it will make 1 mistake in this group. One customer will pass the screening and still default later. The other 99 will receive a correct result and be identified as future defaulters. This would look pretty attractive to anyone trying to reduce default rates. In this group of 10,000 applicants, 9,900 will not default. The test is 99% accurate, and so 99 of those customers will wrongly be identified as credit risks.
This would look unattractive to anyone who is paid a commission on selling loans to customers. So, we have a total of 198 customers identified as credit risks, of which 99 will default and 99 will not. So in this case, if you are identified as a credit risk, then you still have a 50% chance of being a good payer… and that’s with a test that is 99% accurate. Incidentally, the chances of a customer passing the credit check and then defaulting are now down to 1 in 10,000, as the table in Figure 1 shows.
This logic is valid in any situation where you’re looking for needles in haystacks, and this is the kind of thing people are doing today under the banner of Big Data: trawling through large volumes of data to find hidden gems or potential threats. Other examples would include mass screening of DNA samples to find people susceptible to cancer or heart disease, or trawling through emails looking for evidence of criminal behaviour.
Now, in the example I gave, I deliberately used clean numbers where the size of the minority we’re looking for (1%) was equal to the error rate. In reality, they’re unlikely to be equal. Figure 1 shows how the rate of false positives varies as the size of the minority and the overall error rate change. What this graph shows very clearly, is that to be useful, the end result of your prediction algorithm needs to generate fewer errors in total than the size of the target population that you’re trying to find.
Furthermore, if you’re looking for small populations in large datasets, you need to know how reliable your prediction is. And that, is highly dependent on the reliability of the data you’re starting with. If you’re looking for 1 in 1,000, then 99% accuracy isn’t good enough, because 90% of your results will be wrong.
What makes a good predictive model?
So let’s imagine we want to apply Big Data techniques to find some needles in a large haystack of data, and we’re expecting these to occur at around 0.1% of our total population. Bayes’ theorem tells us that to be useful, our predictive algorithm needs to be more than 99.9% accurate. There are two ways of improving the accuracy: by using the best possible algorithm, or by using the best possible data available.
There’s been a lot of work done on algorithms and nearly all of it is available. Market-leading analytical software is affordable, so if you really are looking for needles in haystacks and there is value in it, the tools are available to you. What’s less obvious is the quality of the input data. While algorithms are reproducible, good data isn’t.
Each new problem that someone needs to tackle requires its own dataset. Each organisation has its own data for solving its own problems, and just because one bank has good data, doesn’t mean that all banks have good data.
Conclusion
The chances are that if you’re looking for needles in haystacks, what you’re finding probably aren’t the needles you were looking for at all. If you’ve assessed the reliability of your predictive model, you may even be wildly over-confident in the results. While you can invest in better algorithms, if you really want better results, you’ll probably only get them by using better-quality data.
Mark will be presenting the following session at the Enterprise Data and BI Conference Europe 2014, 3-5 November, London: Big Data: Hidden Gems or Fool’s Gold?
About the Author
Mark Humphries is a Senior Business Consultant for IPL who has applied Business Analytics to complex problems facing energy suppliers in a highly competitive market. Mark has also been a Data Manager and an Operations Manager for a European energy supplier, so he knows the value of actionable, reliable data.