Data Drift Analysis Tool - Instructions

1. Prerequisites

Before using this tool, ensure you have:

AWS account with appropriate permissions
S3 bucket set up
SNS topic configured for notifications
Python 3.8+ environment

2. S3 Bucket Setup

Create the following folder structure in your S3 bucket: s3://///

your-bucket/
├── config/
│   └── config.json
├── data/
│   ├── reference/
│   │   └── reference_dataset.csv
│   ├── predictions/
│   └── current_dataset.csv
└── results/
    └── feature_analysis.csv

3. Data Drift Analysis

The tool automatically:

Loads reference and current datasets
Performs drift detection tests
Calculates feature importance
Generates detailed reports
Sends email notifications

4. Available Drift Tests

You can specify any of these tests in your config for different columns:

'ks' (Kolmogorov-Smirnov test)
'chisquare' (Chi-square test)
'wasserstein' (Wasserstein distance)
'jensen_shannon' (Jensen-Shannon divergence)
'psi' (Population Stability Index)

5. Output Format

The tool generates two types of output:

5.1 CSV Results File

A feature_analysis.csv file containing:

Feature name
Timestamp
Feature type
Month
Feature importance method
Feature importance score
Drift test used
Drift test score
Drift test threshold
Drift test p-value
Drift detected flag

5.2 Email Notifications

You'll receive formatted ASCII tables showing:

Overall summary statistics
List of drifted columns with scores
List of non-drifted columns
Feature importance rankings

8. Limitations

Lambda timeout constraints
Memory limitations based on Lambda configuration
S3 bucket permissions must be properly configured
SNS topic must have appropriate access policies

9. Troubleshooting

If you encounter issues:

Check CloudWatch logs for detailed error messages
Verify S3 bucket permissions
Ensure SNS topic ARN is correct
Validate config.json format
Check if data types in CSV files match configuration

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
Credit_score_cleaned_data_Aug.csv		Credit_score_cleaned_data_Aug.csv
Credit_score_cleaned_data_Nov.csv		Credit_score_cleaned_data_Nov.csv
Credit_score_cleaned_data_Oct.csv		Credit_score_cleaned_data_Oct.csv
Credit_score_cleaned_data_Sep.csv		Credit_score_cleaned_data_Sep.csv
Dockerfile		Dockerfile
README.md		README.md
analysis_log.csv		analysis_log.csv
app.py		app.py
diagram-20241229.svg		diagram-20241229.svg
generate_config.py		generate_config.py
how_to_specify_stattest_for_a_testsuite.ipynb		how_to_specify_stattest_for_a_testsuite.ipynb
localscript.py		localscript.py
pip.py		pip.py
pipeline.md		pipeline.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Drift Analysis Tool - Instructions

1. Prerequisites

2. S3 Bucket Setup

3. Data Drift Analysis

4. Available Drift Tests

5. Output Format

5.1 CSV Results File

5.2 Email Notifications

8. Limitations

9. Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Drift Analysis Tool - Instructions

1. Prerequisites

2. S3 Bucket Setup

3. Data Drift Analysis

4. Available Drift Tests

5. Output Format

5.1 CSV Results File

5.2 Email Notifications

8. Limitations

9. Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages