ML_AccountingFraud

This code repository is for a project that utilises machine learning techniques to detect accounting fraud.

Warning

This repo does not contain the scripts used in the final version of the article. See Wiki for more info.

Dataset:

The data is a merged table from COMPUSTAT (via WRDS) and SEC AAER Dataset (available here). Terms of access from COMPUSTAT and AAER Dataset apply. We are prohibited from disclosing the original dataset. ALL GVKEYS in this repository are randomized and may not be treated as a primary source of data.

Replication:

In a terminal run the following script to make sure necessary requirements are loaded:

conda install conda-build
# create env (reads environment.yml in current directory)
conda env create -f environment.yml
# activate now
conda activate ml-fraud
# to make the package accessible you can uncomment below
# conda develop /path/to/your/package

In a Python environment run the following to load the module:

from MLFraud_module import ML_Fraud as mf

Define your settings as:

a = mf (PARAMETERS) where PARAMETERS are:

sample_start = 1991: Calendar year marking the start of the sample;

test_sample = range (2001,2011): testing/out-of-sample period;

OOS_per = 1: out-of-sample rolling period in years;

OOS_gap = 0: Gap between training and testing samples in years;

sampling = "expanding": sampling style either "expanding"/"rolling";

adjust_serial = True: A boolean variable to adjust for serial frauds;

cv_flag = False: A boolean variable whether to replicate the cross-validation;

cv_k = 10: The number of folds (k) in the cross-validation;

write = True: A boolean variable whether to write results into csv files; and

IS_per = 10: Number of calendar years in case a rolling training sample is used.

Choose any of the following methods:

a.sumstats(): to generate summary statistics and compute variance inflation factors;

a.mc_analysis(adjust_serial=True):

to conduct a Monte Carlo simulation for evaluating impact of serial frauds adjust_serial can be True, False, or "baised";

a.analyse_ratio(): to generate classification forecasts based on 11 financial ratios;

a.analyse_raw(): to generate classification forecasts based on 28 raw financial figures as in Bao (2020);

a.analyse_fk(): to generate classification forecasts based on 23 raw financial figures as in Cecchini (2010);

a.analyse_forward(): to generate forward-looking performance results;

a.compare_ada(): to compare LogitBoost with AdaBoost as in appendix; and

a.compare_logit(): to compare Logit with the Rare Event Logit as in appendix.

Third party resources:

These scripts use free-to-access Python modules Numpy, Pandas, Statsmodels, Matplotlib, Sklearn, and Imblearn.
The script extra_codes.py contains the class relogit based on Rare Event Logistic Regression of King and Zeng (2001).
The script extra_codes.py contains a function that was adapted from the Bao et al (2020) code repository.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.idea		.idea
MLFraud_module		MLFraud_module
.DS_Store		.DS_Store
.gitignore		.gitignore
FraudDB2020_Part1.csv		FraudDB2020_Part1.csv
FraudDB2020_Part2.csv		FraudDB2020_Part2.csv
FraudDB2020_Part3.csv		FraudDB2020_Part3.csv
FraudDB2020_Part4.csv		FraudDB2020_Part4.csv
README.md		README.md
check_random_forest.py		check_random_forest.py
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample_script.py		sample_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML_AccountingFraud

Warning

Dataset:

Replication:

Third party resources:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ML_AccountingFraud

Warning

Dataset:

Replication:

Third party resources:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages