This project was developed for Eleveo with the objective of generating synthetic lactation data that closely resembles real-world datasets while ensuring proper anonymization.
To achieve this, a Python script is used to read the production database, extract key statistical characteristics, and generate new data that preserves the original distributions and proportions.
The project aims to:
- Ensure full anonymization of real data
- Preserve statistical properties such as mean and standard deviation for all lactation parameters
- Generate a SQLite database that maintains the original data distribution and proportions
- Is the generated data fully anonymized?
- Does the generated data accurately preserve the statistical characteristics of the original dataset?
- Sqlite 3 Database
- Python (sk-learn)
- Power BI
This section outlines the most significant statistical findings derived from the Exploratory Data Analysis (EDA) phase.
65% of the alive cows belong to breed "4".
Lact quantity per laction follows a normal distribution
Number of lactation for each alive cows
Sqlite Database provided by Client
N/A in this project
Three tables are directly copied from the input database to the output database:
- Breed: contains all breeds present in the database
- ETAPE_CTRL_TEST: contains all possible steps in a lactation control process
- CTRL_TYPE: contains all possible types of lactation control
Two Tables must be generated in order to assure anonymyzation :
Creating number of farms desired with respect to input database repartition concerning municipality (first two number in postal code -> https://fr.wikipedia.org/wiki/Code_postal_en_Belgique)
Creating number of farms desired with respect to input database repartition concerning number of cows per Postal code and per herd
![]() |
![]() |
Not perfect but sometimes the random doesn't really represents really -> accepted by client
![]() |
![]() |
All datas were generated by GradientBoosterRegressor where : MILK depends on :
- Breed
- NOLACT
- Veil Month
- Veil Duration MG and PROT depend on MILK
smaller accuracy from one of the model is 65% :
- Doesn't understand the importance of NOLACT and depend only on Breed but good enough for the customer
- Anonymization granted
- Accuracy good (65%)
- Data consistency accepted by customer
- Data Solution accepted and used by customer



