Skip to content

aksprat/do-inference-benchmark

Repository files navigation

DigitalOcean Inference Benchmark

A web app for benchmarking any model on the DigitalOcean Serverless Inference engine. Pick models by ID, set your token mix and cache-hit ratio, and compare latency, throughput, and behavior under concurrency side by side. Built with Next.js so it deploys to App Platform in a few clicks.

What it measures

Speed (single request, run sequentially):

  • Time to first token (TTFT), p50 and p95
  • Output throughput (tokens/sec)
  • End-to-end latency, p50 and p95
  • Actual prompt / completion / cached token counts from the API

Concurrency sweep (load test):

  • p50 / p95 / p99 end-to-end latency at each concurrency level
  • Completed requests per second
  • Aggregate output tokens/sec across all in-flight requests
  • Error rate (where rate limits start to bite)

Results render as live charts and a comparison table, and export to CSV or JSON.

Inputs

  • Models: any model ID from the DigitalOcean model list. Pick from the catalog or paste any ID. Up to 6 at once.
  • Input tokens / Output tokens: target prompt size and max generation.
  • Cache hit ratio: the fraction of the input sent as a reused, cacheable prefix. On cache-enabled models a higher ratio lowers TTFT and shows up as cached tokens in the usage data. See "How cache modeling works" below.
  • Concurrency levels and requests per level for the load test.
  • Streaming on/off, temperature, and mock mode.

How cache modeling works

The app builds each prompt from two parts:

  1. A system message that is identical across every request in a run. This is the cacheable prefix. Its size is inputTokens * cacheHitRatio.
  2. A user message that carries a unique nonce so it never caches. Its size is the remainder.

Reusing an identical prefix lets whatever prompt caching the endpoint supports kick in, so the TTFT and cached-token numbers you see reflect real caching behavior rather than a formula. Token sizing of the synthetic prompt is approximate, but every number shown comes from the model's reported usage, not from our estimate.

Run locally

npm install
cp .env.example .env.local   # add your Model Access Key
npm run dev                  # http://localhost:3000

Mock mode is on by default, so you can explore the UI with simulated timings and no API key. Uncheck "mock mode" to hit the live endpoint.

Environment variables

Variable Required Default Notes
DO_MODEL_ACCESS_KEY for live runs none Create one in the DO Control Panel. You can also paste a key in the UI instead.
DO_INFERENCE_BASE_URL no https://inference.do-ai.run/v1 OpenAI-compatible base URL.

Deploy to App Platform

Option A: dashboard. Push this folder to a Git repo, then in App Platform choose Create App, import the repo, and App Platform auto-detects Next.js. Add DO_MODEL_ACCESS_KEY as an encrypted environment variable.

Option B: spec file. Edit the github block in .do/app.yaml, then:

doctl apps create --spec .do/app.yaml

Set the DO_MODEL_ACCESS_KEY secret in the dashboard or with doctl apps update.

Notes

  • The app calls the inference endpoint server-side, so a key pasted in the UI is sent only to this app's own backend, never to the browser-exposed network.
  • Long load tests run as a streamed response. Keep request counts reasonable on small instance sizes.
  • Cost modeling and a small quality-eval mode are natural next additions; the result types already carry the token counts needed for cost math.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors