Generate Data — Dummy Data Generator for PostgreSQL

Reads DDL (CREATE TABLE) files and automatically generates realistic INSERT statements with type-aware and context-aware dummy data. Built for any database tables works with any PostgreSQL schema.

Overview

Manually writing INSERT statements with realistic dummy data for 80-column healthcare tables is tedious and error-prone. This tool automates it:

Drop a DDL file into schema_table/
Run the script — it parses column names and types
Get INSERT statements — smart enough to generate emails for *_email columns, phone numbers for *_telp, proper timestamps, gender codes, etc.

What it handles

Scenario	Behavior
Serial / auto-increment columns	Skipped — handled by the database
`varchar(255)` email column	Generates `user_N@example.com`
`varchar(50)` phone column	Generates `08xxxxxxxxxx` format
`bpchar(1)` gender column	Randomly picks `L` or `P`
`timestamp` created_date	Uses `now()`
`date` birth date column	Generates dates in 1950-2023 range
`int4` foreign key column	Random integer in plausible FK range
DEFAULT columns	Included in INSERT (overriding defaults)
Columns with no pattern match	Falls back to type-based generator

Quick Start

Prerequisites

Python 3.8+ (no external packages — stdlib only)
PostgreSQL (to execute the generated SQL)

1. Add your DDL files

Copy the example schemas or place your own CREATE TABLE .sql files in schema_table/:

# Start with the provided examples
cp -r example/schema_table/* schema_table/

# Or add your own
cp /path/to/your/ddl/*.sql schema_table/

schema_table/
├── users.sql           # (from example)
├── orders.sql          # (from example)
└── your_table.sql      # (your own)

2. Run the generator

# Generate 1000 rows per table (default)
python generate_dummy_sql.py

# Generate 10,000 rows per table
python generate_dummy_sql.py --rows 10000

# Use a different DDL folder
python generate_dummy_sql.py --schema-folder my_ddls --output-folder my_output

3. Execute the generated SQL

# Against your PostgreSQL database
psql -U postgres -d mydb -f generate_dummy/insert_t_pendaftaran.sql

How It Works

┌────────────────────────────────┐
│  schema_table/*.sql             │
│  (CREATE TABLE DDL files)       │
│                                 │
│  CREATE TABLE public.t_xxx (    │
│    id serial4 NOT NULL,         │
│    name varchar(100),           │
│    email varchar(50),           │
│    created_date timestamp,      │
│    ...                          │
│  );                             │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  1. parse_create_table()        │
│     • Regex extract CREATE      │
│     • Find schema & table name  │
│     • Parse each column:        │
│       name, type, size,         │
│       is_serial, has_default,   │
│       is_not_null               │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  2. _get_dummy_value()          │
│     For each column:            │
│                                  │
│     a. Serial? → skip           │
│     b. Name matches pattern?    │
│        email → 'user_N@...'     │
│        phone → '08xxxxxxxxxx'   │
│        kelamin → 'L' or 'P'     │
│        created_by → 'generator' │
│        created_date → now()     │
│     c. Fall back to type:       │
│        varchar → md5 hash       │
│        int4 → random 1-1000     │
│        date → 2020-2025 range   │
│        bool → random true/false │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  3. generate_insert_statement() │
│                                 │
│  INSERT INTO public.t_xxx (     │
│    name, email, created_date    │
│  )                              │
│  SELECT                         │
│    'Name_' || md5(g::text),     │
│    'user_' || g || '@ex.com',   │
│    now()                        │
│  FROM generate_series(1,1000)g; │
└──────────────┬─────────────────┘
               │
               ▼
┌────────────────────────────────┐
│  generate_dummy/insert_t_xxx.sql│
│  (Ready to execute)             │
└────────────────────────────────┘

The `g` variable

All dummy expressions run inside SELECT ... FROM generate_series(1, N) g. The g represents the current row number (1, 2, 3, ... N). This is used to create variation — e.g., 'user_' || g produces user_1, user_2, etc.

Project Structure

generate_data/
├── README.md                     # This documentation
├── generate_dummy_sql.py         # Main generator script
├── requirements.txt              # No external deps (stdlib only)
├── .gitignore                    # Ignores generated SQL + __pycache__
│
├── example/                          # Public example schemas (safe to share)
│   ├── schema_table/
│   │   ├── users.sql                 # Example: generic users table
│   │   └── orders.sql                # Example: generic orders table
│   └── generate_dummy/
│       └── .gitkeep
│
├── schema_table/                     # INPUT: Your private DDL files (gitignored)
│   └── *.sql                         # Copy from example/ or add your own
│
└── generate_dummy/                   # OUTPUT: Generated INSERT statements (gitignored)
    └── insert_*.sql

Supported Column Patterns

The generator recognizes these naming conventions (common in Indonesian HIS schemas):

Column Name Pattern	Example Column	Generated Value
`email`	`pasien_email`	`user_42@example.com`
`phone`, `telp`, `hp`	`pendaftaran_pasien_notelp`	`08123456789`
`nama`, `name`	`pendaftaran_pasien_nama`	`Name_a1b2c3d4...`
`alamat`, `address`	`pendaftaran_pasien_alamat`	`Address_e5f6...`
`kode`, `code`	`kode_barang`	`CODE000042`
`_no`, `_number`	`pendaftaran_no`	`NO00000042`
`kelamin`, `gender`	`pendaftaran_pasien_kelamin`	`L` or `P` (random)
`aktif`, `active`	`pendaftaran_aktif`	`y` (default active)
`created_by`, `updated_by`	`pendaftaran_created_by`	`generator`
`created_date`, `updated_date`	`pendaftaran_created_date`	`now()`
`created_at`, `updated_at`	—	`now()`
`tgllahir*`	`pendaftaran_pasien_tgllahir`	Random date 1950-2023

Type-Based Fallbacks

All PostgreSQL data types now supported:

Numeric types

PostgreSQL Type	Generated Value
`serial`, `serial2`, `serial4`, `serial8`	Skipped (auto-increment)
`int2`, `smallint`	Random 1-100
`int4`, `integer`	Random 1-1000
`int8`, `bigint`	Random 1-10000
`float4`, `real`	Random 0-1000
`float8`, `double precision`	Random 0-1000
`numeric`, `decimal`	Random 0-1000
`money`	Random 0-10000 formatted as currency

String types

PostgreSQL Type	Generated Value
`varchar(N)`, `character varying(N)`	`data_` + MD5 hash (truncated to size)
`text`	`text_` + MD5 hash
`char(N)`, `character(N)`	`X` or `Y`
`bpchar`	`Y`

Date/Time types

PostgreSQL Type	Generated Value
`date`	Random date 2020-2025
`timestamp`, `timestamptz`	Random timestamp 2020-2025
`time`, `timetz`	Random time 00:00-23:59
`interval`	Random interval 0-365 days

Boolean

PostgreSQL Type	Generated Value
`boolean`, `bool`	Random `true`/`false`

Binary

PostgreSQL Type	Generated Value
`bytea`	16-byte MD5 hash as binary

Network types

PostgreSQL Type	Generated Value
`inet`	Random `192.168.x.x` address
`cidr`	Random `10.0.x.0/24` network
`macaddr`	Random `08:00:27:xx:xx:xx` MAC
`macaddr8`	Random `08:00:27:00:xx:xx:xx` MAC-8

Bit strings

PostgreSQL Type	Generated Value
`bit(N)`, `bit varying(N)`, `varbit`	`B'0101'` pattern

JSON / XML / Full-text

PostgreSQL Type	Generated Value
`json`	`{"key": "val_N"}`
`jsonb`	`{"key": "val_N", "count": N}`
`jsonpath`	`$.dummy.N` path expression
`xml`	`<root><id>N</id><name>dummy_N</name></root>`
`tsvector`	`to_tsvector('english', ...)`
`tsquery`	`to_tsquery('english', 'dummy')`

UUID / System

PostgreSQL Type	Generated Value
`uuid`	`gen_random_uuid()`
`pg_lsn`	Random LSN like `16/ABCDEF`

Geometric types

PostgreSQL Type	Generated Value
`point`	Random (x, y) 0-100
`line`	Line through (0,0) and random point
`lseg`	Segment from (0,0) to random point
`box`	Box from (0,0) to random point
`path`	2-point path with random endpoint
`polygon`	Triangle with 2 random vertices
`circle`	Circle centered at (0,0), radius 0-10

Range types

PostgreSQL Type	Generated Value
`int4range`	Random range `[1, 1-100]`
`int8range`	Random range `[1, 1-1000]`
`numrange`	Random numeric range
`tsrange`	Timestamp range now → now+30d
`tstzrange`	TimestampTZ range now → now+30d
`daterange`	Date range today → today+30d

Array types (any base type)

PostgreSQL Type	Generated Value
`integer[]`	`ARRAY[1, 5, 42]` (2-4 random values)
`text[]`	`ARRAY['data_abc', 'data_def']`
`varchar(100)[]`	`ARRAY['data_...', 'data_...']`
`int4[][]`	Multi-dimensional array
Any `type[]`	Detects `[]` suffix and wraps element generator

Usage

CLI Options

python generate_dummy_sql.py [OPTIONS]

Flag	Default	Description
`--rows N`	`1000`	Number of dummy rows per table
`--schema-folder PATH`	`schema_table`	Folder with CREATE TABLE .sql files
`--output-folder PATH`	`generate_dummy`	Where to write generated INSERT files

ℹ Confidential data is gitignored. schema_table/ and generate_dummy/ are your private working directories. Example schemas are in example/schema_table/ — copy them to get started.

Examples

# Default: 1000 rows from schema_table/ → generate_dummy/
python generate_dummy_sql.py

# Generate 100K rows per table
python generate_dummy_sql.py --rows 100000

# Custom folders
python generate_dummy_sql.py --schema-folder ../hisv2/ddl --output-folder ../hisv2/seed

# Process only one specific file by creating a temp folder
mkdir temp_schema
cp schema_table/t_pendaftaran.sql temp_schema/
python generate_dummy_sql.py --schema-folder temp_schema --rows 5000

Example Output

Input: schema_table/t_pendaftaran.sql (80+ column patient registration table)

Output: generate_dummy/insert_t_pendaftaran.sql

-- Generated dummy data for public.t_pendaftaran
-- Total rows: 1000

INSERT INTO public.t_pendaftaran (
    pendaftaran_no,
    pendaftaran_nourut,
    m_unit_id,
    pendaftaran_mrs,
    pendaftaran_shift,
    m_pasien_id,
    ...
)
SELECT
    'NO' || lpad(g::text, 8, '0') AS pendaftaran_no,
    (1 + floor(random() * 100))::int2 AS pendaftaran_nourut,
    (1 + floor(random() * 1000))::int4 AS m_unit_id,
    timestamp '2020-01-01' + (...) AS pendaftaran_mrs,
    (ARRAY['L','P'])[ceil(random() * 2)]::bpchar(1) AS pendaftaran_shift,
    ...
FROM generate_series(1, 1000) g;

Adding New Tables

Export the DDL from your PostgreSQL database:

-- In psql:
\o schema_table/my_new_table.sql
\d+ my_new_table
\o

Or use pg_dump:

pg_dump -U user -d db -t my_new_table --schema-only > schema_table/my_new_table.sql

Run the generator:

python generate_dummy_sql.py

Execute the output:

psql -U postgres -d mydb -f generate_dummy/insert_my_new_table.sql

Supported DDL formats

The parser handles:

CREATE TABLE table_name (...)
CREATE TABLE IF NOT EXISTS schema.table_name (...)
Multi-line column definitions
DEFAULT, NOT NULL, PRIMARY KEY modifiers
CONSTRAINT lines (automatically skipped)
COMMENT ON COLUMN (ignored)
CREATE INDEX (ignored)
ALTER TABLE / GRANT (ignored)

Customizing Dummy Values

Modify type-based generators

Edit the type_generators dictionary in generate_dummy_sql.py:

self.type_generators = {
    ...
    'varchar': "'custom_prefix_' || md5(g::text)",
    'int4': "(1000 + floor(random() * 9000))::int4",  # 4-digit numbers
    ...
}

Add a new column name pattern

Add an elif block in _get_dummy_value():

# Custom: detect ICD code columns
elif col_name.startswith('icd_') and 'char' in col_type:
    return f"'A' || lpad(g::text, 3, '0') || '.0'"  # A001.0 format

Change default values for specific tables

The generator is table-agnostic by design, but you can extend _get_dummy_value() to accept the table name and add table-specific logic:

def _get_dummy_value(self, column, table_name):
    if table_name == 't_pendaftaran' and column['name'] == 'pendaftaran_jenis':
        return "(ARRAY['RJ','RI','IGD','PM'])[ceil(random() * 4)]"
    # ... rest of the logic

Dependencies

Package	Required	Purpose
Python 3.8+	Yes	Runtime
PostgreSQL	Yes (target)	To execute generated SQL
External packages	None	Pure stdlib: `re`, `argparse`, `pathlib`, `typing`

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generate Data — Dummy Data Generator for PostgreSQL

Table of Contents

Overview

What it handles

Quick Start

Prerequisites

1. Add your DDL files

2. Run the generator

3. Execute the generated SQL

How It Works

The `g` variable

Project Structure

Supported Column Patterns

Type-Based Fallbacks

Usage

CLI Options

Examples

Example Output

Adding New Tables

Supported DDL formats

Customizing Dummy Values

Modify type-based generators

Add a new column name pattern

Change default values for specific tables

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
example		example
.gitignore		.gitignore
README.md		README.md
generate_dummy_sql.py		generate_dummy_sql.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Generate Data — Dummy Data Generator for PostgreSQL

Table of Contents

Overview

What it handles

Quick Start

Prerequisites

1. Add your DDL files

2. Run the generator

3. Execute the generated SQL

How It Works

The g variable

Project Structure

Supported Column Patterns

Type-Based Fallbacks

Usage

CLI Options

Examples

Example Output

Adding New Tables

Supported DDL formats

Customizing Dummy Values

Modify type-based generators

Add a new column name pattern

Change default values for specific tables

Dependencies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `g` variable

Packages