Live demo: https://grapho2c.onrender.com
SAP's Order-to-Cash process spans at least a dozen tables. A sales order becomes a delivery, a delivery becomes an invoice, an invoice becomes a journal entry, and a journal entry becomes a payment. When this data lives in flat tables, tracing those connections requires complex SQL joins across every step. GraphO2C solves that by loading all 20 SAP tables into a single knowledge graph and letting you explore and query the full O2C flow visually and in plain English.
You type a question. The system figures out what SQL to run, executes it against DuckDB, and returns a grounded answer with the relevant graph nodes highlighted on screen.
- Ingests 20 JSONL tables of SAP-style O2C data into DuckDB
- Builds a directed NetworkX graph with 1,262 nodes and 2,059 edges
- Visualises the graph in the browser using Cytoscape.js — click any node to inspect its properties and expand its neighbours
- Accepts natural language questions, generates SQL dynamically, executes it, and returns data-backed answers
- A separate trace mode walks the graph directly for questions like "show me the full flow of sales order 740506", returning the complete chain: SalesOrder → DeliveryItem → BillingDocument → JournalEntry → Payment
- Highlights the nodes referenced in each answer on the graph automatically
The O2C process is inherently relational and directional. A sales order spawns items, items spawn delivery items, delivery items spawn billing documents, billing documents post to journal entries, journal entries clear into payments. Representing this as a directed graph makes traversal and gap detection natural — finding "orders delivered but not billed" is a graph reachability query, not a five-table join.
The dataset is read-only and analytical. DuckDB reads multi-part JSONL files with a single glob expression (read_json_auto('folder/part-*.jsonl')) and runs fast columnar aggregations with no server process, no connection pooling, and no schema migration. The .duckdb file travels with the project, making the deployment self-contained. PostgreSQL would add a server to manage and a migration pipeline to maintain with zero benefit for this workload.
At 1,262 nodes and 2,059 edges, the graph fits comfortably in memory. NetworkX gives shortest-path, BFS, and neighbour lookup as single function calls. Neo4j makes sense when you have tens of millions of nodes, concurrent writes, or need a persistent graph database that survives process restarts. For this scale, Neo4j would add significant operational overhead — a separate process, a driver, a query language (Cypher) — without any measurable benefit.
Groq's inference hardware runs llama-3.3-70b at speeds that make a two-step pipeline (SQL generation then answer synthesis) feel interactive rather than slow. The model is large enough to follow a complex schema prompt reliably and generate correct DuckDB SQL on the first try in the majority of cases.
Separating SQL generation from answer synthesis means each step can be tuned independently. SQL generation runs at temperature 0.0 (deterministic, same question always produces the same SQL). Answer synthesis runs at temperature 0.3 (slightly looser for natural language). If they were combined, tuning one would compromise the other.
DuckDB is used as the analytical store. Key decisions:
- Storage: A single
.duckdbfile committed to the repository. Render serves it without any database server. - Ingestion:
ingestion.pyreads allpart-*.jsonlfiles from each Dataset folder using DuckDB'sread_json_autoglob. Twenty raw tables are loaded, then two denormalised views (v_customer,v_product) are created to flatten joins the LLM would otherwise have to reconstruct. - Smart cold start: On startup, the code checks whether tables already exist. If yes, ingestion is skipped. This means the first deploy is slow (full ingest) but every subsequent restart is fast.
- Graph construction: After ingestion,
graph_builder.pyqueries DuckDB to join tables and build all nodes and edges in memory using NetworkX. DuckDB is then used only at query time for SQL-mode answers.
The graph has 11 node types and 13 directed edge types.
| Node Type | Source | Primary Key |
|---|---|---|
| Customer | v_customer (business_partners + addresses + assignments) | businessPartner |
| Product | v_product (products + product_descriptions) | product |
| Plant | plants | plant |
| SalesOrder | sales_order_headers | salesOrder |
| SalesOrderItem | sales_order_items | salesOrder + salesOrderItem |
| OutboundDelivery | outbound_delivery_headers | deliveryDocument |
| DeliveryItem | outbound_delivery_items | deliveryDocument + deliveryDocumentItem |
| BillingDocument | billing_document_headers | billingDocument |
| BillingDocItem | billing_document_items | billingDocument + billingDocumentItem |
| JournalEntry | journal_entry_items | companyCode + fiscalYear + accountingDocument + item |
| Payment | payments | companyCode + fiscalYear + accountingDocument + item |
SalesOrder ──PLACED_BY──► Customer
SalesOrder ──HAS_ITEM──► SalesOrderItem
SalesOrderItem ──ORDERS_PRODUCT──► Product
SalesOrderItem ──PRODUCED_AT──► Plant
SalesOrderItem ──FULFILLED_BY──► DeliveryItem
DeliveryItem ──PART_OF──► OutboundDelivery
DeliveryItem ──BILLED_AS──► BillingDocItem
BillingDocItem ──BILLED_IN──► BillingDocument
BillingDocument ──BILLED_TO──► Customer
BillingDocument ──POSTED_TO──► JournalEntry
BillingDocument ──CANCELS──► BillingDocument (cancellation link)
JournalEntry ──CLEARED_BY──► Payment
Payment ──PAID_BY──► Customer
Every node ID follows the pattern TypeName_PrimaryKey — for example SalesOrder_740506 or DeliveryItem_80738076_000010. This convention means SQL result columns can be mapped back to graph nodes for automatic highlighting without any additional lookup.
The pipeline has two distinct LLM calls per query.
The system prompt is approximately 180 lines and contains:
- Full schema listing — every table name, every column name, primary keys, and whether the column is a boolean or VARCHAR in DuckDB (this distinction matters because DuckDB booleans do not accept
'X'or'true', onlyTRUE/FALSE) - Join chain — the exact sequence of foreign key relationships from
sales_order_headersthrough topayments - SAP status code reference — what
A,B,Cmean foroverallDeliveryStatus,overallGoodsMovementStatus, andoverallOrdReltdBillgStatus - Execution rules — use DuckDB syntax (
ILIKE,TRY_CAST,strftime), no recursive CTEs, defaultLIMIT 50on SELECT queries, return only a JSON envelope:{"query_type": "sql", "sql": "...", "explanation": "..."} - Off-topic instruction — if the question has nothing to do with the O2C dataset, return the literal string
OFF_TOPICwith no JSON envelope
Temperature 0.0 means the same question always produces the same SQL. This makes regression testing straightforward.
The SQL executes against DuckDB. The results (capped at 100 rows) are sent back to the LLM with a short prompt that says: answer based only on what the data shows, not on general knowledge. If exactly 100 rows are returned, the model is warned that results may be truncated and must not state a definitive count.
For questions containing words like "trace", "flow", "journey", or specific document IDs, the pipeline detects the pattern and switches to graph traversal instead of SQL. It finds the matching node in NetworkX using BFS, collects all reachable O2C nodes in flow order (SalesOrder → SalesOrderItem → DeliveryItem → OutboundDelivery → BillingDocument → JournalEntry → Payment), and sends that structured data to the LLM for synthesis. This gives a richer, path-aware answer that SQL cannot easily produce.
If Step 1 produces SQL that fails on execution, the error message is sent back to the LLM for one correction attempt. Only if the corrected SQL also fails does the error surface to the user.
The system has six layers of guardrail:
| Layer | What it does |
|---|---|
| Input length cap | Questions longer than 500 characters are rejected before any LLM call |
| OFF_TOPIC — raw output check | If the LLM returns the literal string OFF_TOPIC, the user gets: "This system is designed to answer questions related to the Order-to-Cash dataset only." |
| OFF_TOPIC — JSON field check | If the parsed JSON has "sql": "OFF_TOPIC", the same rejection fires — prevents envelope manipulation |
| JSON parse guard | Malformed LLM output triggers an error response rather than a crash |
| Result truncation warning | If results hit 100 rows, the synthesis prompt warns the model not to state a definitive count |
| Rate limit handling | Groq RateLimitError returns a friendly capacity message rather than a 500 error |
Tested examples:
- "What is the capital of France?" → guardrail rejects correctly
- "Write me a poem" → guardrail rejects correctly
- "Which products have the most billing documents?" → SQL generated and executed correctly
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Node and edge counts — confirms graph loaded |
| GET | /graph/summary |
Counts broken down by node type and edge type |
| GET | /graph/nodes?type=X&limit=50 |
Paginated list of nodes of a given type |
| GET | /graph/node/{node_id} |
All properties of one node plus its 1-hop neighbours |
| GET | /graph/path?from=X&to=Y |
Shortest directed path between two nodes |
| POST | /query |
Natural language question → grounded answer + nodes referenced |
| GET | /metrics |
Live uptime, query count, error rate, memory usage |
Interactive API docs: https://grapho2c.onrender.com/docs
| Question | Query mode |
|---|---|
| Which products are associated with the highest number of billing documents? | SQL — multi-table aggregation |
| Trace the full flow of sales order 740506 | Graph traversal trace |
| How many orders were delivered but not billed? | SQL — gap analysis |
| Which customers have the highest total order value? | SQL — revenue ranking |
| Show all billing documents that were cancelled | SQL — status filtering |
| What is the journal entry linked to billing document 90504204? | SQL — document lookup |
Requires Python 3.10+ and a free Groq API key from console.groq.com.
git clone https://github.com/yerramsettysuchita/GraphO2C.git
cd GraphO2C/backend
pip install -r requirements.txt
cp .env.example .env
# Edit .env and set: GROQ_API_KEY=gsk_...
python main.pyOpen http://localhost:8000 in your browser.
The first run ingests all Dataset files and writes grapho2c.duckdb. Every subsequent run reuses the file. Delete it to force a full re-ingestion.
GraphO2C/
├── Dataset/ ← 20 JSONL table folders (SAP O2C data)
├── backend/
│ ├── main.py ← Entry point: ingest, build graph, start API
│ ├── db.py ← DuckDB singleton connection
│ ├── ingestion.py ← Glob-based JSONL ingestion + denormalised views
│ ├── graph_builder.py ← Builds the NetworkX DiGraph (13 edge types)
│ ├── llm.py ← Two-step Groq pipeline + graph trace mode
│ ├── api.py ← FastAPI routes + static file serving
│ ├── metrics.py ← In-process query/error counters
│ ├── constants.py ← Node type colours shared with tests
│ ├── requirements.txt
│ └── .env.example
├── frontend/
│ ├── src/
│ │ ├── App.jsx ← Root layout
│ │ ├── api.js ← Fetch wrapper for all API calls
│ │ ├── hooks/
│ │ │ ├── useServer.js ← Polls /health until graph is ready
│ │ │ ├── useGraph.js ← Cytoscape lifecycle: load, expand, highlight, search
│ │ │ └── useChat.js ← Chat state, 45s timeout, slow-query indicator
│ │ └── components/
│ │ ├── TopBar.jsx
│ │ ├── GraphCanvas.jsx
│ │ ├── ChatPanel.jsx
│ │ ├── NodeInspector.jsx
│ │ └── LoadingOverlay.jsx
│ └── vite.config.js
├── frontend-dist/ ← Pre-built bundle (committed — served directly by FastAPI)
├── render.yaml ← Render deployment config
└── .github/workflows/test.yml ← CI: pytest + bundle size check