Reads DDL (CREATE TABLE) files and automatically generates realistic INSERT statements with type-aware and context-aware dummy data. Built for any database tables works with any PostgreSQL schema.
- Overview
- Quick Start
- How It Works
- Project Structure
- Supported Column Patterns
- Usage
- Example Output
- Adding New Tables
- Customizing Dummy Values
Manually writing INSERT statements with realistic dummy data for 80-column healthcare tables is tedious and error-prone. This tool automates it:
- Drop a DDL file into
schema_table/ - Run the script — it parses column names and types
- Get INSERT statements — smart enough to generate emails for
*_emailcolumns, phone numbers for*_telp, proper timestamps, gender codes, etc.
| Scenario | Behavior |
|---|---|
| Serial / auto-increment columns | Skipped — handled by the database |
varchar(255) email column |
Generates user_N@example.com |
varchar(50) phone column |
Generates 08xxxxxxxxxx format |
bpchar(1) gender column |
Randomly picks L or P |
timestamp created_date |
Uses now() |
date birth date column |
Generates dates in 1950-2023 range |
int4 foreign key column |
Random integer in plausible FK range |
| DEFAULT columns | Included in INSERT (overriding defaults) |
| Columns with no pattern match | Falls back to type-based generator |
- Python 3.8+ (no external packages — stdlib only)
- PostgreSQL (to execute the generated SQL)
Copy the example schemas or place your own CREATE TABLE .sql files in schema_table/:
# Start with the provided examples
cp -r example/schema_table/* schema_table/
# Or add your own
cp /path/to/your/ddl/*.sql schema_table/schema_table/
├── users.sql # (from example)
├── orders.sql # (from example)
└── your_table.sql # (your own)
# Generate 1000 rows per table (default)
python generate_dummy_sql.py
# Generate 10,000 rows per table
python generate_dummy_sql.py --rows 10000
# Use a different DDL folder
python generate_dummy_sql.py --schema-folder my_ddls --output-folder my_output# Against your PostgreSQL database
psql -U postgres -d mydb -f generate_dummy/insert_t_pendaftaran.sql┌────────────────────────────────┐
│ schema_table/*.sql │
│ (CREATE TABLE DDL files) │
│ │
│ CREATE TABLE public.t_xxx ( │
│ id serial4 NOT NULL, │
│ name varchar(100), │
│ email varchar(50), │
│ created_date timestamp, │
│ ... │
│ ); │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ 1. parse_create_table() │
│ • Regex extract CREATE │
│ • Find schema & table name │
│ • Parse each column: │
│ name, type, size, │
│ is_serial, has_default, │
│ is_not_null │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ 2. _get_dummy_value() │
│ For each column: │
│ │
│ a. Serial? → skip │
│ b. Name matches pattern? │
│ email → 'user_N@...' │
│ phone → '08xxxxxxxxxx' │
│ kelamin → 'L' or 'P' │
│ created_by → 'generator' │
│ created_date → now() │
│ c. Fall back to type: │
│ varchar → md5 hash │
│ int4 → random 1-1000 │
│ date → 2020-2025 range │
│ bool → random true/false │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ 3. generate_insert_statement() │
│ │
│ INSERT INTO public.t_xxx ( │
│ name, email, created_date │
│ ) │
│ SELECT │
│ 'Name_' || md5(g::text), │
│ 'user_' || g || '@ex.com', │
│ now() │
│ FROM generate_series(1,1000)g; │
└──────────────┬─────────────────┘
│
▼
┌────────────────────────────────┐
│ generate_dummy/insert_t_xxx.sql│
│ (Ready to execute) │
└────────────────────────────────┘
All dummy expressions run inside SELECT ... FROM generate_series(1, N) g. The g represents the current row number (1, 2, 3, ... N). This is used to create variation — e.g., 'user_' || g produces user_1, user_2, etc.
generate_data/
├── README.md # This documentation
├── generate_dummy_sql.py # Main generator script
├── requirements.txt # No external deps (stdlib only)
├── .gitignore # Ignores generated SQL + __pycache__
│
├── example/ # Public example schemas (safe to share)
│ ├── schema_table/
│ │ ├── users.sql # Example: generic users table
│ │ └── orders.sql # Example: generic orders table
│ └── generate_dummy/
│ └── .gitkeep
│
├── schema_table/ # INPUT: Your private DDL files (gitignored)
│ └── *.sql # Copy from example/ or add your own
│
└── generate_dummy/ # OUTPUT: Generated INSERT statements (gitignored)
└── insert_*.sql
The generator recognizes these naming conventions (common in Indonesian HIS schemas):
| Column Name Pattern | Example Column | Generated Value |
|---|---|---|
*email* |
pasien_email |
user_42@example.com |
*phone*, *telp*, *hp* |
pendaftaran_pasien_notelp |
08123456789 |
*nama*, *name* |
pendaftaran_pasien_nama |
Name_a1b2c3d4... |
*alamat*, *address* |
pendaftaran_pasien_alamat |
Address_e5f6... |
*kode*, *code* |
kode_barang |
CODE000042 |
*_no, *_number |
pendaftaran_no |
NO00000042 |
*kelamin*, *gender* |
pendaftaran_pasien_kelamin |
L or P (random) |
*aktif*, *active* |
pendaftaran_aktif |
y (default active) |
*created_by, *updated_by |
pendaftaran_created_by |
generator |
*created_date, *updated_date |
pendaftaran_created_date |
now() |
*created_at, *updated_at |
— | now() |
*tgl*lahir* |
pendaftaran_pasien_tgllahir |
Random date 1950-2023 |
All PostgreSQL data types now supported:
Numeric types
| PostgreSQL Type | Generated Value |
|---|---|
serial, serial2, serial4, serial8 |
Skipped (auto-increment) |
int2, smallint |
Random 1-100 |
int4, integer |
Random 1-1000 |
int8, bigint |
Random 1-10000 |
float4, real |
Random 0-1000 |
float8, double precision |
Random 0-1000 |
numeric, decimal |
Random 0-1000 |
money |
Random 0-10000 formatted as currency |
String types
| PostgreSQL Type | Generated Value |
|---|---|
varchar(N), character varying(N) |
data_ + MD5 hash (truncated to size) |
text |
text_ + MD5 hash |
char(N), character(N) |
X or Y |
bpchar |
Y |
Date/Time types
| PostgreSQL Type | Generated Value |
|---|---|
date |
Random date 2020-2025 |
timestamp, timestamptz |
Random timestamp 2020-2025 |
time, timetz |
Random time 00:00-23:59 |
interval |
Random interval 0-365 days |
Boolean
| PostgreSQL Type | Generated Value |
|---|---|
boolean, bool |
Random true/false |
Binary
| PostgreSQL Type | Generated Value |
|---|---|
bytea |
16-byte MD5 hash as binary |
Network types
| PostgreSQL Type | Generated Value |
|---|---|
inet |
Random 192.168.x.x address |
cidr |
Random 10.0.x.0/24 network |
macaddr |
Random 08:00:27:xx:xx:xx MAC |
macaddr8 |
Random 08:00:27:00:xx:xx:xx MAC-8 |
Bit strings
| PostgreSQL Type | Generated Value |
|---|---|
bit(N), bit varying(N), varbit |
B'0101' pattern |
JSON / XML / Full-text
| PostgreSQL Type | Generated Value |
|---|---|
json |
{"key": "val_N"} |
jsonb |
{"key": "val_N", "count": N} |
jsonpath |
$.dummy.N path expression |
xml |
<root><id>N</id><name>dummy_N</name></root> |
tsvector |
to_tsvector('english', ...) |
tsquery |
to_tsquery('english', 'dummy') |
UUID / System
| PostgreSQL Type | Generated Value |
|---|---|
uuid |
gen_random_uuid() |
pg_lsn |
Random LSN like 16/ABCDEF |
Geometric types
| PostgreSQL Type | Generated Value |
|---|---|
point |
Random (x, y) 0-100 |
line |
Line through (0,0) and random point |
lseg |
Segment from (0,0) to random point |
box |
Box from (0,0) to random point |
path |
2-point path with random endpoint |
polygon |
Triangle with 2 random vertices |
circle |
Circle centered at (0,0), radius 0-10 |
Range types
| PostgreSQL Type | Generated Value |
|---|---|
int4range |
Random range [1, 1-100] |
int8range |
Random range [1, 1-1000] |
numrange |
Random numeric range |
tsrange |
Timestamp range now → now+30d |
tstzrange |
TimestampTZ range now → now+30d |
daterange |
Date range today → today+30d |
Array types (any base type)
| PostgreSQL Type | Generated Value |
|---|---|
integer[] |
ARRAY[1, 5, 42] (2-4 random values) |
text[] |
ARRAY['data_abc', 'data_def'] |
varchar(100)[] |
ARRAY['data_...', 'data_...'] |
int4[][] |
Multi-dimensional array |
Any type[] |
Detects [] suffix and wraps element generator |
python generate_dummy_sql.py [OPTIONS]| Flag | Default | Description |
|---|---|---|
--rows N |
1000 |
Number of dummy rows per table |
--schema-folder PATH |
schema_table |
Folder with CREATE TABLE .sql files |
--output-folder PATH |
generate_dummy |
Where to write generated INSERT files |
ℹ Confidential data is gitignored.
schema_table/andgenerate_dummy/are your private working directories. Example schemas are inexample/schema_table/— copy them to get started.
# Default: 1000 rows from schema_table/ → generate_dummy/
python generate_dummy_sql.py
# Generate 100K rows per table
python generate_dummy_sql.py --rows 100000
# Custom folders
python generate_dummy_sql.py --schema-folder ../hisv2/ddl --output-folder ../hisv2/seed
# Process only one specific file by creating a temp folder
mkdir temp_schema
cp schema_table/t_pendaftaran.sql temp_schema/
python generate_dummy_sql.py --schema-folder temp_schema --rows 5000Input: schema_table/t_pendaftaran.sql (80+ column patient registration table)
Output: generate_dummy/insert_t_pendaftaran.sql
-- Generated dummy data for public.t_pendaftaran
-- Total rows: 1000
INSERT INTO public.t_pendaftaran (
pendaftaran_no,
pendaftaran_nourut,
m_unit_id,
pendaftaran_mrs,
pendaftaran_shift,
m_pasien_id,
...
)
SELECT
'NO' || lpad(g::text, 8, '0') AS pendaftaran_no,
(1 + floor(random() * 100))::int2 AS pendaftaran_nourut,
(1 + floor(random() * 1000))::int4 AS m_unit_id,
timestamp '2020-01-01' + (...) AS pendaftaran_mrs,
(ARRAY['L','P'])[ceil(random() * 2)]::bpchar(1) AS pendaftaran_shift,
...
FROM generate_series(1, 1000) g;- Export the DDL from your PostgreSQL database:
-- In psql:
\o schema_table/my_new_table.sql
\d+ my_new_table
\oOr use pg_dump:
pg_dump -U user -d db -t my_new_table --schema-only > schema_table/my_new_table.sql- Run the generator:
python generate_dummy_sql.py- Execute the output:
psql -U postgres -d mydb -f generate_dummy/insert_my_new_table.sqlThe parser handles:
CREATE TABLE table_name (...)CREATE TABLE IF NOT EXISTS schema.table_name (...)- Multi-line column definitions
- DEFAULT, NOT NULL, PRIMARY KEY modifiers
- CONSTRAINT lines (automatically skipped)
- COMMENT ON COLUMN (ignored)
- CREATE INDEX (ignored)
- ALTER TABLE / GRANT (ignored)
Edit the type_generators dictionary in generate_dummy_sql.py:
self.type_generators = {
...
'varchar': "'custom_prefix_' || md5(g::text)",
'int4': "(1000 + floor(random() * 9000))::int4", # 4-digit numbers
...
}Add an elif block in _get_dummy_value():
# Custom: detect ICD code columns
elif col_name.startswith('icd_') and 'char' in col_type:
return f"'A' || lpad(g::text, 3, '0') || '.0'" # A001.0 formatThe generator is table-agnostic by design, but you can extend _get_dummy_value() to accept the table name and add table-specific logic:
def _get_dummy_value(self, column, table_name):
if table_name == 't_pendaftaran' and column['name'] == 'pendaftaran_jenis':
return "(ARRAY['RJ','RI','IGD','PM'])[ceil(random() * 4)]"
# ... rest of the logic| Package | Required | Purpose |
|---|---|---|
| Python 3.8+ | Yes | Runtime |
| PostgreSQL | Yes (target) | To execute generated SQL |
| External packages | None | Pure stdlib: re, argparse, pathlib, typing |
MIT