A web app for benchmarking any model on the DigitalOcean Serverless Inference engine. Pick models by ID, set your token mix and cache-hit ratio, and compare latency, throughput, and behavior under concurrency side by side. Built with Next.js so it deploys to App Platform in a few clicks.
Speed (single request, run sequentially):
- Time to first token (TTFT), p50 and p95
- Output throughput (tokens/sec)
- End-to-end latency, p50 and p95
- Actual prompt / completion / cached token counts from the API
Concurrency sweep (load test):
- p50 / p95 / p99 end-to-end latency at each concurrency level
- Completed requests per second
- Aggregate output tokens/sec across all in-flight requests
- Error rate (where rate limits start to bite)
Results render as live charts and a comparison table, and export to CSV or JSON.
- Models: any model ID from the DigitalOcean model list. Pick from the catalog or paste any ID. Up to 6 at once.
- Input tokens / Output tokens: target prompt size and max generation.
- Cache hit ratio: the fraction of the input sent as a reused, cacheable prefix. On cache-enabled models a higher ratio lowers TTFT and shows up as cached tokens in the usage data. See "How cache modeling works" below.
- Concurrency levels and requests per level for the load test.
- Streaming on/off, temperature, and mock mode.
The app builds each prompt from two parts:
- A system message that is identical across every request in a run. This is
the cacheable prefix. Its size is
inputTokens * cacheHitRatio. - A user message that carries a unique nonce so it never caches. Its size is the remainder.
Reusing an identical prefix lets whatever prompt caching the endpoint supports kick in, so the TTFT and cached-token numbers you see reflect real caching behavior rather than a formula. Token sizing of the synthetic prompt is approximate, but every number shown comes from the model's reported usage, not from our estimate.
npm install
cp .env.example .env.local # add your Model Access Key
npm run dev # http://localhost:3000Mock mode is on by default, so you can explore the UI with simulated timings and no API key. Uncheck "mock mode" to hit the live endpoint.
| Variable | Required | Default | Notes |
|---|---|---|---|
DO_MODEL_ACCESS_KEY |
for live runs | none | Create one in the DO Control Panel. You can also paste a key in the UI instead. |
DO_INFERENCE_BASE_URL |
no | https://inference.do-ai.run/v1 |
OpenAI-compatible base URL. |
Option A: dashboard. Push this folder to a Git repo, then in App Platform
choose Create App, import the repo, and App Platform auto-detects Next.js. Add
DO_MODEL_ACCESS_KEY as an encrypted environment variable.
Option B: spec file. Edit the github block in .do/app.yaml, then:
doctl apps create --spec .do/app.yamlSet the DO_MODEL_ACCESS_KEY secret in the dashboard or with
doctl apps update.
- The app calls the inference endpoint server-side, so a key pasted in the UI is sent only to this app's own backend, never to the browser-exposed network.
- Long load tests run as a streamed response. Keep request counts reasonable on small instance sizes.
- Cost modeling and a small quality-eval mode are natural next additions; the result types already carry the token counts needed for cost math.