Use This Tool to Look for Fraud in Big Data
The April 2017 Journal of Accountancy had an interesting article regarding the use of Benford’s Law to spot fraud. I experimented with real-life data and determined that Benford's Law can be used to quickly scan huge datasets for potential fraud indicators. The underlying idea is that natural patterns are everywhere, even in data that might appear, at first blush, to be random.
In the 1930’s, engineer and physicist Frank Benford was looking at a book containing logarithmic tables and noticed that pages early in the book were more tattered than pages appearing later in the book. Over time, Benford studied many other datasets and observed the same pattern…that is, ‘1’ appeared most frequently and ‘9’ appeared least frequently when evaluating the first (left-most) digit of each number in a column. The first digit is generally distributed in a pattern that resembles a logarithmic scale.
According to Benford’s Law, the expected distribution of the first (left-most) digit of a number is shown below:
How Can Benford’s Law Spot Potential Fraud?
If a dataset is real (naturally generated), and if the dataset spans multiple orders of magnitude (e.g. – it is not a narrow range of numbers between, say, 1 and 5), then Benford’s Law will likely apply. It is a perfect tool to use on either Big Data, or on an organization’s smaller datasets. The process of evaluating a dataset is straight forward:
The first (left-most) non-zero digit of each number in a column (e.g. transaction amount) is stripped off and placed into a new column; the other digits in each number are discarded
Using the newly created column of numbers (each number is now only 1 digit), count each occurrence of 1, 2, 3, 4, 5, 6, 7, 8, and 9
Plot the results and compare them to Benford’s expected distribution
If the result is significantly different than Benford’s Law, then raise a red flag
It is unlikely that someone committing fraud will have the awareness, or make the effort, to ensure that manufactured data follows naturally occurring patterns.
Example of Benford's Law Using Real Data
Credit Card Transactions
I obtained a dataset containing 282,982 European credit card transactions from a two-day period in 2013 from Kaggle.com (personal information was not included in the dataset). The evaluation process outlined above was applied to purchase amount column of the dataset with the following result:
The real credit card transactions, illustrated above, closely resemble what would be expected according to Benford’s Law.
But, what if I manufacture data using Excel’s random number function to generate 282,982 credit card purchase amounts? Here is that result - fake data looks nothing like Benford's Law!
A red flag should be raised and the data should be reviewed with a skeptical eye.
This technique has many potential applications including:
Part of the validation process for automated data feeds
Analytical review large balance sheet accounts such as accounts receivable
Reviewing the trial balance of an acquisition target
Examining a database of insurance claims
Evaluating any number of Big Data metrics, including data that you may have bought for research or sales analysis
The list of applications is essentially endless. The technique applies to all industries and to both financial and operational metrics. It is important to use this tool carefully because false positives may occur, or certain datasets may not yield meaningful results. Add this to your toolbox and see what you can discover.
Collins, C., CPA. (April, 2017). Using Excel and Benford's Law to Detect Fraud. Journal of Accountancy, 44-50.
Dal Pozzolo, A., Caelen, O., Johnson, R., & Bontempi, G. (2015). Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM). Retrieved June 08, 2017, from www.kaggle.com.