Map primary keys, foreign keys, and table relationships across any data source.
Cartographer is an internal data tool built by Common Gene Labs. It takes tabular data — from files or live database connections — and surfaces the hidden structure within it: which columns are primary keys, which are foreign keys, and how tables relate to each other. It produces an interactive entity-relationship diagram alongside column-level metadata, an exportable relationships report, and ready-to-use dbt and Mermaid output.
Originally conceived and designed by Dr. Amelia Miramonti, PhD.
Upload CSV, Excel, Parquet, JSON, or a range of other tabular formats, or connect directly to a database. Cartographer runs a multi-signal inference engine across your tables — examining column names, value overlap, cardinality, format patterns, and statistical distributions — to identify likely relationships. Results are shown as an interactive ERD you can drag, zoom, focus, search, and export.
When a database connection is used, declared foreign key constraints are imported directly, giving you an accurate diagram without inference.
File formats
| Format | Extensions |
|---|---|
| CSV | .csv |
| TSV | .tsv |
| Excel | .xlsx, .xlsm, .xls |
| OpenDocument Spreadsheet | .ods |
| Parquet | .parquet |
| JSON array | .json |
| Newline-delimited JSON | .ndjson |
| SPSS | .sav, .por |
| SAS | .sas7bdat, .xpt |
| R data | .rds, .rdata, .rda |
| Stata | .dta |
| MATLAB | .mat |
| HDF5 | .h5, .hdf5 |
Multi-sheet Excel files, multi-object R files, and multi-dataset HDF5 files are loaded as separate tables automatically.
Database connections
| Database | Notes |
|---|---|
| PostgreSQL | Standard credentials |
| MySQL / MariaDB | Standard credentials |
| SQL Server | Requires Microsoft ODBC Driver 13, 17, or 18 |
| Snowflake | Account identifier + optional warehouse and role |
| Google BigQuery | Project + dataset; service account JSON or Application Default Credentials |
| Amazon Redshift | Standard Redshift cluster endpoint |
Cartographer runs up to 7 signals in parallel on each candidate column pair and combines their scores into a composite confidence rating.
| Signal | What it measures |
|---|---|
| Naming | Exact name patterns: id, {table}_id, {table}_key |
| Name similarity | Fuzzy match between column names across tables |
| Value overlap | Fraction of values in column A that exist in column B |
| Cardinality | Whether column A has FK-like cardinality (many values → few unique) |
| Format fingerprint | Whether both columns share the same value format (UUID, date, integer, etc.) |
| Distribution similarity | KS-test on numeric distributions |
| Null-pattern correlation | Pearson correlation of null positions |
Each relationship gets a low / medium / high confidence label. A minimum confidence threshold can be set in the sidebar to filter noise. Individual relationships can be promoted, demoted, or suppressed directly from the Relationships tab — edits persist across reruns and are included in session saves.
Schema-defined relationships and database FK constraints always take priority over inferred ones.
Large tables: tables over 100,000 rows are automatically sampled to 10,000 rows for analysis. The original row count is preserved and displayed wherever the table appears.
The ERD tab renders an interactive vis.js network with the following capabilities:
Layout modes
| Mode | Behaviour |
|---|---|
| Force | Physics-based spring layout (default) |
| Hierarchical | Top-down directed tree, physics disabled |
| Circular | Nodes arranged in a ring post-stabilization |
| Compact | Tighter spring constants for dense schemas |
Interaction
- Drag nodes to reposition them — positions are pinned and restored on every subsequent render, including after page refresh
- Double-click a node to unpin it and let physics take over
- Scroll to zoom
- Hover a node for a full column list with PK/FK annotations, row count, and source
- Hover an edge for the full relationship path, detection method, confidence score, and contributing signals
- Search — type in the search box above the diagram to highlight matching tables and zoom to them
- Focus mode — click the Focus button, then click any table to isolate its immediate neighbourhood; all other nodes and edges are dimmed
Toolbar buttons
| Button | Action |
|---|---|
| Focus | Toggle focus-on-click mode |
| Map | Toggle minimap overlay (bottom-right corner) |
| PNG ↓ | Export the current canvas as cartographer_erd.png |
Visual encoding
- Node borders glow yellow for tables matching the current search query
- Solid edges indicate naming/schema/cardinality signals; dashed edges indicate statistical signals (value overlap, distribution, null pattern, format)
- Edge thickness and opacity both encode confidence: high → thick/opaque, low → thin/faint
- Source-group background ellipses appear when tables come from more than one source, labelled with the source name
Each table is shown as a card with pills indicating row count, column count, source format, detected PKs, and inferred FK references.
Below each card:
- A column summary dataframe lists every column with its type, non-null count, unique count, and PK/FK flags
- A Show preview checkbox reveals the first 5 rows of the actual data
- A ⚠ Possible unlinked columns banner appears when a column name is shared with another table but no relationship has been detected, flagging it as a potential missed FK
Tables that were auto-sampled show a ⚠ SAMPLED N of M rows banner above the column summary.
Relationships are grouped by detection method. Each row shows:
- Source and target table + column
- Detection method badge (colour-coded by signal type)
- Confidence dot and score for auto-detected relationships
- Signal chips listing each contributing signal
Inline confidence editing — each auto-detected relationship has ▲ Promote / ▼ Demote / ✕ Suppress controls. Suppressed relationships are hidden from the ERD and all exports. A Clear overrides button resets all edits.
| Export | Format | Contents |
|---|---|---|
| Relationships CSV | .csv |
All relationships with confidence, score, signals, and reasons |
| dbt schema.yml | .yml |
dbt models with unique, not_null, and relationships tests |
| Mermaid ERD | .mmd |
erDiagram block ready to paste into GitHub, Notion, or Confluence |
| Save session | .json |
Full workspace snapshot: tables, relationships, node positions, confidence overrides |
A Restore session uploader lets you reload a previously saved workspace, including all node positions and manual edits.
Running locally is the most privacy-preserving option — your data never leaves your machine. See DATA_PRIVACY.md for a full audit of exactly where uploaded files and database data are stored and processed.
Prerequisites: Python 3.10 or later. SQL Server additionally requires the Microsoft ODBC Driver 17 or 18.
# 1. Clone and enter the directory
git clone <repo-url>
cd cartographer
# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows (cmd)
# venv\Scripts\Activate.ps1 # Windows (PowerShell)
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run
streamlit run app.pyOpens at http://localhost:8501.
Positron is a data science IDE from Posit. The workflow is the same as any terminal-based Python project:
- Open the
cartographerfolder (File > Open Folder). - Open the Positron terminal (View > Terminal) and create the virtual environment:
python -m venv venv
- Select the interpreter — use the picker in the status bar and choose
venv/bin/python(macOS/Linux) orvenv\Scripts\python.exe(Windows). - Install dependencies:
pip install -r requirements.txt
- Launch the app:
streamlit run app.py
- Open
http://localhost:8501in your browser. PressCtrl+Cto stop.
The steps are identical to Positron. Use the integrated terminal (Ctrl+</kbd>) and the Python interpreter picker in the status bar to select your venv`.
streamlit>=1.32.0
pandas>=2.0.0
networkx>=3.0
pyvis>=0.3.2
pyyaml>=6.0
scipy>=1.10.0
openpyxl>=3.0.0
xlrd>=2.0.0
odfpy>=1.4.0
pyarrow>=12.0.0
pyodbc>=4.0.0
snowflake-connector-python>=3.0.0
psycopg2-binary>=2.9.0
pymysql>=1.1.0
google-cloud-bigquery>=3.0.0
google-cloud-bigquery-storage>=2.0.0
db-dtypes>=1.0.0
redshift-connector>=2.0.0
pyreadstat>=1.2.0
pyreadr>=0.4.0
scipy>=1.10.0
h5py>=3.0.0
SQL Server also requires the Microsoft ODBC Driver installed on the host machine.
- Push to GitHub
- Connect at share.streamlit.io
- Set main file to
app.py - Deploy — no additional configuration needed
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app.py .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]docker build -t cartographer .
docker run -p 8501:8501 cartographerNo external service dependencies. Start command:
streamlit run app.py --server.port=$PORT --server.address=0.0.0.0
Relationships can be declared explicitly via a JSON or YAML schema file, without needing row data. Schema-defined relationships take priority over all inference.
{
"tables": [
{
"name": "orders",
"columns": [
{ "name": "order_id", "type": "integer", "primary_key": true },
{ "name": "customer_id", "type": "integer",
"foreign_key": { "table": "customers", "column": "customer_id" } }
]
},
{
"name": "customers",
"columns": [
{ "name": "customer_id", "type": "integer", "primary_key": true },
{ "name": "name", "type": "text" }
]
}
]
}An example_schema.json is included — a 5-table e-commerce model ready to load.
.
├── app.py # Main Streamlit application
├── inference.py # Multi-signal FK inference engine
├── schema_parser.py # JSON/YAML schema parser
├── db_connectors.py # Database connector classes
├── requirements.txt # Python dependencies
├── example_schema.json # Sample schema for testing
└── README.md
Cartographer was originally conceived and designed by Dr. Amelia Miramonti, PhD.
Built by Common Gene Labs — an independent lab building human-centered tools for starting, focus, and emotional friction.