Skip to content

Pranestya-GW/generate_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generate Data — Dummy Data Generator for PostgreSQL

Reads DDL (CREATE TABLE) files and automatically generates realistic INSERT statements with type-aware and context-aware dummy data. Built for any database tables works with any PostgreSQL schema.


Table of Contents


Overview

Manually writing INSERT statements with realistic dummy data for 80-column healthcare tables is tedious and error-prone. This tool automates it:

  1. Drop a DDL file into schema_table/
  2. Run the script — it parses column names and types
  3. Get INSERT statements — smart enough to generate emails for *_email columns, phone numbers for *_telp, proper timestamps, gender codes, etc.

What it handles

Scenario Behavior
Serial / auto-increment columns Skipped — handled by the database
varchar(255) email column Generates user_N@example.com
varchar(50) phone column Generates 08xxxxxxxxxx format
bpchar(1) gender column Randomly picks L or P
timestamp created_date Uses now()
date birth date column Generates dates in 1950-2023 range
int4 foreign key column Random integer in plausible FK range
DEFAULT columns Included in INSERT (overriding defaults)
Columns with no pattern match Falls back to type-based generator

Quick Start

Prerequisites

  • Python 3.8+ (no external packages — stdlib only)
  • PostgreSQL (to execute the generated SQL)

1. Add your DDL files

Copy the example schemas or place your own CREATE TABLE .sql files in schema_table/:

# Start with the provided examples
cp -r example/schema_table/* schema_table/

# Or add your own
cp /path/to/your/ddl/*.sql schema_table/
schema_table/
├── users.sql           # (from example)
├── orders.sql          # (from example)
└── your_table.sql      # (your own)

2. Run the generator

# Generate 1000 rows per table (default)
python generate_dummy_sql.py

# Generate 10,000 rows per table
python generate_dummy_sql.py --rows 10000

# Use a different DDL folder
python generate_dummy_sql.py --schema-folder my_ddls --output-folder my_output

3. Execute the generated SQL

# Against your PostgreSQL database
psql -U postgres -d mydb -f generate_dummy/insert_t_pendaftaran.sql

How It Works

┌────────────────────────────────┐
│  schema_table/*.sql             │
│  (CREATE TABLE DDL files)       │
│                                 │
│  CREATE TABLE public.t_xxx (    │
│    id serial4 NOT NULL,         │
│    name varchar(100),           │
│    email varchar(50),           │
│    created_date timestamp,      │
│    ...                          │
│  );                             │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  1. parse_create_table()        │
│     • Regex extract CREATE      │
│     • Find schema & table name  │
│     • Parse each column:        │
│       name, type, size,         │
│       is_serial, has_default,   │
│       is_not_null               │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  2. _get_dummy_value()          │
│     For each column:            │
│                                  │
│     a. Serial? → skip           │
│     b. Name matches pattern?    │
│        email → 'user_N@...'     │
│        phone → '08xxxxxxxxxx'   │
│        kelamin → 'L' or 'P'     │
│        created_by → 'generator' │
│        created_date → now()     │
│     c. Fall back to type:       │
│        varchar → md5 hash       │
│        int4 → random 1-1000     │
│        date → 2020-2025 range   │
│        bool → random true/false │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  3. generate_insert_statement() │
│                                 │
│  INSERT INTO public.t_xxx (     │
│    name, email, created_date    │
│  )                              │
│  SELECT                         │
│    'Name_' || md5(g::text),     │
│    'user_' || g || '@ex.com',   │
│    now()                        │
│  FROM generate_series(1,1000)g; │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  generate_dummy/insert_t_xxx.sql│
│  (Ready to execute)             │
└────────────────────────────────┘

The g variable

All dummy expressions run inside SELECT ... FROM generate_series(1, N) g. The g represents the current row number (1, 2, 3, ... N). This is used to create variation — e.g., 'user_' || g produces user_1, user_2, etc.


Project Structure

generate_data/
├── README.md                     # This documentation
├── generate_dummy_sql.py         # Main generator script
├── requirements.txt              # No external deps (stdlib only)
├── .gitignore                    # Ignores generated SQL + __pycache__
│
├── example/                          # Public example schemas (safe to share)
│   ├── schema_table/
│   │   ├── users.sql                 # Example: generic users table
│   │   └── orders.sql                # Example: generic orders table
│   └── generate_dummy/
│       └── .gitkeep
│
├── schema_table/                     # INPUT: Your private DDL files (gitignored)
│   └── *.sql                         # Copy from example/ or add your own
│
└── generate_dummy/                   # OUTPUT: Generated INSERT statements (gitignored)
    └── insert_*.sql

Supported Column Patterns

The generator recognizes these naming conventions (common in Indonesian HIS schemas):

Column Name Pattern Example Column Generated Value
*email* pasien_email user_42@example.com
*phone*, *telp*, *hp* pendaftaran_pasien_notelp 08123456789
*nama*, *name* pendaftaran_pasien_nama Name_a1b2c3d4...
*alamat*, *address* pendaftaran_pasien_alamat Address_e5f6...
*kode*, *code* kode_barang CODE000042
*_no, *_number pendaftaran_no NO00000042
*kelamin*, *gender* pendaftaran_pasien_kelamin L or P (random)
*aktif*, *active* pendaftaran_aktif y (default active)
*created_by, *updated_by pendaftaran_created_by generator
*created_date, *updated_date pendaftaran_created_date now()
*created_at, *updated_at now()
*tgl*lahir* pendaftaran_pasien_tgllahir Random date 1950-2023

Type-Based Fallbacks

All PostgreSQL data types now supported:

Numeric types

PostgreSQL Type Generated Value
serial, serial2, serial4, serial8 Skipped (auto-increment)
int2, smallint Random 1-100
int4, integer Random 1-1000
int8, bigint Random 1-10000
float4, real Random 0-1000
float8, double precision Random 0-1000
numeric, decimal Random 0-1000
money Random 0-10000 formatted as currency

String types

PostgreSQL Type Generated Value
varchar(N), character varying(N) data_ + MD5 hash (truncated to size)
text text_ + MD5 hash
char(N), character(N) X or Y
bpchar Y

Date/Time types

PostgreSQL Type Generated Value
date Random date 2020-2025
timestamp, timestamptz Random timestamp 2020-2025
time, timetz Random time 00:00-23:59
interval Random interval 0-365 days

Boolean

PostgreSQL Type Generated Value
boolean, bool Random true/false

Binary

PostgreSQL Type Generated Value
bytea 16-byte MD5 hash as binary

Network types

PostgreSQL Type Generated Value
inet Random 192.168.x.x address
cidr Random 10.0.x.0/24 network
macaddr Random 08:00:27:xx:xx:xx MAC
macaddr8 Random 08:00:27:00:xx:xx:xx MAC-8

Bit strings

PostgreSQL Type Generated Value
bit(N), bit varying(N), varbit B'0101' pattern

JSON / XML / Full-text

PostgreSQL Type Generated Value
json {"key": "val_N"}
jsonb {"key": "val_N", "count": N}
jsonpath $.dummy.N path expression
xml <root><id>N</id><name>dummy_N</name></root>
tsvector to_tsvector('english', ...)
tsquery to_tsquery('english', 'dummy')

UUID / System

PostgreSQL Type Generated Value
uuid gen_random_uuid()
pg_lsn Random LSN like 16/ABCDEF

Geometric types

PostgreSQL Type Generated Value
point Random (x, y) 0-100
line Line through (0,0) and random point
lseg Segment from (0,0) to random point
box Box from (0,0) to random point
path 2-point path with random endpoint
polygon Triangle with 2 random vertices
circle Circle centered at (0,0), radius 0-10

Range types

PostgreSQL Type Generated Value
int4range Random range [1, 1-100]
int8range Random range [1, 1-1000]
numrange Random numeric range
tsrange Timestamp range now → now+30d
tstzrange TimestampTZ range now → now+30d
daterange Date range today → today+30d

Array types (any base type)

PostgreSQL Type Generated Value
integer[] ARRAY[1, 5, 42] (2-4 random values)
text[] ARRAY['data_abc', 'data_def']
varchar(100)[] ARRAY['data_...', 'data_...']
int4[][] Multi-dimensional array
Any type[] Detects [] suffix and wraps element generator

Usage

CLI Options

python generate_dummy_sql.py [OPTIONS]
Flag Default Description
--rows N 1000 Number of dummy rows per table
--schema-folder PATH schema_table Folder with CREATE TABLE .sql files
--output-folder PATH generate_dummy Where to write generated INSERT files

Confidential data is gitignored. schema_table/ and generate_dummy/ are your private working directories. Example schemas are in example/schema_table/ — copy them to get started.

Examples

# Default: 1000 rows from schema_table/ → generate_dummy/
python generate_dummy_sql.py

# Generate 100K rows per table
python generate_dummy_sql.py --rows 100000

# Custom folders
python generate_dummy_sql.py --schema-folder ../hisv2/ddl --output-folder ../hisv2/seed

# Process only one specific file by creating a temp folder
mkdir temp_schema
cp schema_table/t_pendaftaran.sql temp_schema/
python generate_dummy_sql.py --schema-folder temp_schema --rows 5000

Example Output

Input: schema_table/t_pendaftaran.sql (80+ column patient registration table)

Output: generate_dummy/insert_t_pendaftaran.sql

-- Generated dummy data for public.t_pendaftaran
-- Total rows: 1000

INSERT INTO public.t_pendaftaran (
    pendaftaran_no,
    pendaftaran_nourut,
    m_unit_id,
    pendaftaran_mrs,
    pendaftaran_shift,
    m_pasien_id,
    ...
)
SELECT
    'NO' || lpad(g::text, 8, '0') AS pendaftaran_no,
    (1 + floor(random() * 100))::int2 AS pendaftaran_nourut,
    (1 + floor(random() * 1000))::int4 AS m_unit_id,
    timestamp '2020-01-01' + (...) AS pendaftaran_mrs,
    (ARRAY['L','P'])[ceil(random() * 2)]::bpchar(1) AS pendaftaran_shift,
    ...
FROM generate_series(1, 1000) g;

Adding New Tables

  1. Export the DDL from your PostgreSQL database:
-- In psql:
\o schema_table/my_new_table.sql
\d+ my_new_table
\o

Or use pg_dump:

pg_dump -U user -d db -t my_new_table --schema-only > schema_table/my_new_table.sql
  1. Run the generator:
python generate_dummy_sql.py
  1. Execute the output:
psql -U postgres -d mydb -f generate_dummy/insert_my_new_table.sql

Supported DDL formats

The parser handles:

  • CREATE TABLE table_name (...)
  • CREATE TABLE IF NOT EXISTS schema.table_name (...)
  • Multi-line column definitions
  • DEFAULT, NOT NULL, PRIMARY KEY modifiers
  • CONSTRAINT lines (automatically skipped)
  • COMMENT ON COLUMN (ignored)
  • CREATE INDEX (ignored)
  • ALTER TABLE / GRANT (ignored)

Customizing Dummy Values

Modify type-based generators

Edit the type_generators dictionary in generate_dummy_sql.py:

self.type_generators = {
    ...
    'varchar': "'custom_prefix_' || md5(g::text)",
    'int4': "(1000 + floor(random() * 9000))::int4",  # 4-digit numbers
    ...
}

Add a new column name pattern

Add an elif block in _get_dummy_value():

# Custom: detect ICD code columns
elif col_name.startswith('icd_') and 'char' in col_type:
    return f"'A' || lpad(g::text, 3, '0') || '.0'"  # A001.0 format

Change default values for specific tables

The generator is table-agnostic by design, but you can extend _get_dummy_value() to accept the table name and add table-specific logic:

def _get_dummy_value(self, column, table_name):
    if table_name == 't_pendaftaran' and column['name'] == 'pendaftaran_jenis':
        return "(ARRAY['RJ','RI','IGD','PM'])[ceil(random() * 4)]"
    # ... rest of the logic

Dependencies

Package Required Purpose
Python 3.8+ Yes Runtime
PostgreSQL Yes (target) To execute generated SQL
External packages None Pure stdlib: re, argparse, pathlib, typing

License

MIT

About

Reads DDL (CREATE TABLE) files and automatically generates realistic INSERT statements with type-aware and context-aware dummy data. Built for any database tables works with any PostgreSQL schema.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages