Invoice Fraud Detection

The Defense Finance and Accounting Service (DFAS) is responsible for disbursing nearly all of the Department of Defense (DoD) funds.

Project Title: Invoice Fraud Detection

Sponsoring Company/Organization: Defense Finance and Accounting Service (DFAS)
Contracting Organization: Federal Data Systems (now Logicon Fed Data), Elder Research, Abbott Analytics
Technical Lead: Abbott Analytics

The Business Problem

The Defense Finance and Accounting Service (DFAS) is responsible for disbursing nearly all of the Department of Defense (DoD) funds. In its effort to be outstanding stewards of these funds, DFAS determined to minimize fraud against DoD financial assets and chose Data Mining as one of its strategies to detect, and ultimately deter, fraud. The data mining models ranked the likelihood that each individual invoice was suspicious, and should therefore should receive further investigation by a human examiner. Therefore, false alarms also had to be minimized so that examiners were not unduly burdened with cases unlikely to be fraudulent.

The Analytics Problem

The data consisted of fields extracted from millions of invoices and stored in a database. Analysts then added a label to each record indicating whether the invoice was "fraudulent" or "not fraudulent" (also called "unlabelled"). Fraudulent invoices were labeled only after they had been investigated by examiners, prosecuted, and judged to be fraudulent. The vast majority of invoices, however, had never been examined, and were assumed not to be fraudulent.

Several problems confronted the analysts. First, due to the small numbers of available records labeled " fraudulent," the analysts could not perform the standard splitting of data into training, testing, and validation data sets. Second, invoices that were unlabelled were not necessarily cleared of suspicion. Although the vast majority of these were likely not fraudulent, some of them were fraudulent, thus polluting the labels if any a large sample of the data was taken. Third, the known fraudulent transactions were not all representing the same fraud scheme; they belonged to several different types of schemes, and therefore labeling all of them the same (as "fraud") could confuse modeling algorithms.

The Approach

For the data mining process, the DFAS extracted transactions for some several adjudicated fraud cases, and used source documents to recreate other transactions to appear as they would in the database. These recreated transactions were added to a moderate number of unlabelled transactions, the vast majority of which were not fraudulent. However, by choosing only thousands of these unlabelled transactions, the likelihood of selecting an unlabelled transaction and calling it non-fraudulent mistakenly was relatively small. This combined data set was split into 11 data subsets via cross-validation sampling. Finally, instead of trying to predict an outcome of merely "fraud" and "not fraud," examiners and analysts were interviewed to provide expert insight into how many types of fraud they had seen. As a result of the interviews, a five-level target variable was created containing four distinct fraud classes types and a non-fraud type.

Results and Delivery

There is strength in diversity. This is an adage that has found support in the literature reviewed in this paper and in the results presented in this paper. In this project, ensemble diversity had two main sources: sample diversity and algorithm diversity. Eleven random samples of the data were used in building models, and each of the six independent analysts were given full freedom to use any algorithm on the random samples assigned to them. The best 11 models are shown in the figure below, where "best" was defined as those models achieving the highest model accuracy while keeping false alarms low.

The figure below shows the relative performance of the 11 models and an additional “ensemble” model that combined the predictions of the other 11 into a final decision (actual sensitivity and false alarm rates of the models are not shown to protect the confidentiality of the models). The ensemble performed better than any of the individual models, and because of this, we avoided the necessity of “handcrafting” models via multiple iterations of the model building process.

In addition, the ensemble detected 97% of known fraud cases (in validation data), and as a result of the models 1,217 payments were selected for further investigation by DFAS.