A production-grade, asynchronous, AI-driven video interview platform.
- Candidates receive a UUID link, complete a hardware check, then talk live with an AI interviewer (Deepgram Voice Agent) that asks questions, listens to their answers, and verbally responds with Groq-generated feedback.
- Recruiters create templates, issue interview links, and review per-question video + transcript + AI score + proctoring flags.
Problem. Manual first-round screening is the most expensive, least scalable part of hiring. A recruiter doing 8 phone screens a day burns 6 hours producing the same low-signal output for each candidate.
System need. We want to:
- Let recruiters set a question bank once, then issue async links to N candidates.
- Capture high-fidelity candidate signal: actual video, actual transcript, actual answer audio.
- Provide immediate AI feedback during the interview (so the candidate feels like a real conversation, not a webcam-into-the-void).
- Surface a recruiter-ready scorecard with proctoring flags so the recruiter only has to review the top-N.
┌────────────────────────────────────────────────────────┐
│ Browser │
│ │
│ HardwareCheck → InterviewPage │
│ │ │ │
│ │ getUserMedia │ │
│ ▼ ▼ │
│ <video> MediaRecorder ──► 5s .webm chunks │
│ │ │ │
│ │ ├──► POST /api/chunks/:token ───┐│
│ │ │ ││
│ │ ┌────────────┴───────────────┐ ││
│ │ │ Deepgram Voice Agent (WS) │ ││
│ │ │ audio in/out + JSON │ ││
│ │ └────────────┬───────────────┘ ││
│ │ │ wss://backend/api/voice-agent/ ││
│ │ Socket.IO ───┴──► proctoring_event ││
└──────┼─────────────────────────────────────────────────┘│
│ │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ Express + ws │ ◄────┘
│ REST: /api/auth, /api/sessions, /api/chunks, │
│ /api/recruiter │
│ WS proxy: /api/voice-agent/:token → Deepgram │
│ Socket.IO: proctoring + recruiter live │
└────────────┬─────────────────┬───────────────────────┘
│ │
▼ ▼
┌─────────┐ ┌──────────┐
│ MongoDB │ │ S3 │ (raw chunks + merged.webm)
└─────────┘ └────┬─────┘
▲ │
│ ┌────┴────┐
└────────────┤ BullMQ │── Worker process
│ Redis │ FFmpeg merge + Groq final score
└─────────┘
- Capture (client):
getUserMediais called exactly once on the HardwareCheck page. The resultingMediaStreamis passed via React Router state intoInterviewPageso the video track is never re-acquired (which on some browsers ends and dies). - Recording (client):
MediaRecorderrecords video+audio in WebM. It emits a Blob every 5 seconds (timeslice). Each Blob is POSTed to/api/chunks/:tokenas multipart form data. - Voice (client ↔ server ↔ Deepgram): A separate
WebSocketopens to our backend's/api/voice-agent/:tokenproxy. Microphone audio is captured separately (via an inlineAudioWorkletthat produces 16k linear PCM) and pushed upstream as binary WS frames. Deepgram returns binary TTS audio (24k PCM) which is played back through Web Audio API, plus JSON control frames (transcripts, speech state). - Chunk storage (server): Each chunk is written to S3 with the deterministic key
${sessionId}/${questionIndex}/chunk_${pad3(chunkIndex)}.webm. The chunk's existence is also recorded on the session document so the worker doesn't depend on S3 listing. - Async processing (worker): When the candidate finishes, the session is locked + a BullMQ job enqueued. The worker downloads chunks for each question, concatenates them with FFmpeg (
-f concat -c copy, no re-encode), uploads the merged.webmto S3, and writes the merged key back to the session. - Scoring (immediate, per question): As soon as a question's transcript is captured, the client calls
/api/sessions/:token/answer/review, which calls Groq (llama-3.1-8b-instant) twice in parallel — once for{score, feedback}JSON, once for a human-sounding 2-3 sentence spoken review — and persists both. The spoken review is then handed back to the Voice Agent to speak aloud. - Recruiter retrieval: Recruiter dashboard polls the session detail every 5s while in
processing. Merged videos are served via pre-signed S3 URLs (1h TTL).
- Socket.IO (proctoring): Client emits
join_session({token})thenproctoring_event(...). Server saves to MongoDB and fan-outs torecruiter_${sessionId}rooms for live dashboards. - Native WS (voice agent proxy): Two-way piping between client and
agent.deepgram.com/agent. The server sniffs JSON frames to extract transcripts and persist them per-question — so even if the client disconnects mid-answer, the transcript is in MongoDB.
| Decision | Why |
|---|---|
| Streaming chunks over a single end-of-interview upload | If a candidate drops at minute 19 of 20, we still have 19 minutes of usable video. Also avoids a multi-hundred-MB blob blocking the UI thread when stopping MediaRecorder. |
| Separating Voice Agent audio from MediaRecorder | They serve different masters. Voice Agent gives us a live transcript + bidirectional speech; MediaRecorder gives us a recording for recruiter playback. Trying to do both with the same audio path means choosing between "transcript quality" and "recording quality" — we get both. |
| Backend WS proxy for Deepgram | The DEEPGRAM_API_KEY never reaches the browser. The proxy also gives us a free instrumentation point — every transcript is logged + persisted server-side. |
Deterministic chunk keys (chunk_NNN) |
Lets FFmpeg concat using a simple sorted listing even if chunks arrived out-of-order or are re-uploaded after retry. Zero-padding to 3 digits makes lexicographic sort = numeric sort. |
| BullMQ + Upstash Redis | Decouples request latency from FFmpeg/Groq latency. Upstash gives us a managed Redis with TLS that BullMQ supports natively. Multiple worker replicas can be added horizontally. |
Atomic lock on completed transition |
A findOneAndUpdate({ isLocked: false }, {$set: {isLocked: true, ...}}) guarantees we never enqueue the worker twice if the client retries the status POST. |
| Per-question Groq during interview (vs. only at the end) | Makes the interview feel like a conversation, which dramatically improves candidate experience and engagement. Also amortises Groq cost across the interview rather than spiking at the end. |
| face-api.js TinyFaceDetector | Smallest model (~190 KB), runs at 224 input size on CPU at >30 FPS on a modern laptop. Good enough for "is there exactly one face?", which is all proctoring needs. Models served from /public/models so we don't depend on an external CDN. |
| JWT auth (recruiter) | Simple, stateless, works behind any reverse proxy. Token is stored in zustand/persist (localStorage). |
| Scenario | Mitigation |
|---|---|
| Network drop during chunk POST | Client retries the chunk once after 1s. If both fail, the chunk is logged & dropped; the worker continues with the remaining chunks. |
| Duplicate chunks (client retried after partial success) | S3 PutObject is idempotent on a key; Mongo upsert checks (questionIndex, chunkIndex) and overwrites. |
| Camera/mic disconnected mid-answer | MediaStreamTrack.onended fires → emits CAMERA_DISCONNECT / MIC_DISCONNECT proctoring event. The MediaRecorder keeps producing the chunks it can; the recruiter sees the disconnect in the flag list. |
| WebSocket reconnect | Socket.IO client has built-in exponential reconnect (10 attempts). On reconnect we re-emit join_session and emit a synthetic RECONNECT event so the recruiter can see the gap. Voice Agent WS is more sensitive; if it drops, the agent stops speaking — the candidate is shown the agent status chip and can continue (chunks still upload and transcript will be reconstructed by the worker if needed). |
| Empty / corrupted media chunk | Two-layer guard: client refuses to POST if blob.size < 100; server refuses with 204 No Content if req.file.size < 100. During merge, individual unreadable chunks are skipped, not allowed to crash the whole job. |
| Partial upload failure (some chunks for Q3 made it, some didn't) | Worker still merges whatever's present. Mongo session.uploadedChunks is the source of truth — the worker reads from it, falling back to an S3 ListObjectsV2 prefix scan if it's empty. |
| Browser tab killed mid-interview | Session stays in in_progress. The recruiter sees it. If the candidate re-opens the link AND isLocked === false, they can resume from currentQuestionIndex (this scaffolding is present — resume UI can be added without schema changes). If isLocked === true (e.g. they already hit "completed"), they see the locked screen forever. |
| Deepgram outage | The agent WS errors → status becomes error, candidate sees the chip. Chunks still upload; the worker still produces a video for the recruiter. The recruiter will see no transcript but full video — better than nothing. |
- State persistence. Every meaningful event is written to MongoDB before responding 200. Session document is the central brain.
- Chunk recovery. Deterministic keys + Mongo
uploadedChunks+ S3 prefix fallback means the merge step is robust to any ordering / retry pattern. - Retry/recovery logic.
- Chunk POST: 1 retry on the client.
- BullMQ job: 5 attempts with exponential backoff (5s base).
- Worker per-question loop: errors in one question never abort others — we degrade gracefully.
- Socket reconnection. Built-in Socket.IO reconnect + synthetic
RECONNECTproctoring event so the recruiter has a precise timeline. - Failure handling.
- If the BullMQ job exhausts all attempts → session status moves to
failedso the recruiter sees it. - Mongo writes use
$setwithfindOneAndUpdatefor the locking critical section; no read-modify-write race. - Worker uses temp directories per-question and
fs.rm(..., {recursive: true, force: true})infinallyblocks so disk doesn't fill up on partial failure.
- If the BullMQ job exhausts all attempts → session status moves to
- Stats row at the top of the dashboard so the recruiter can sanity-check at a glance ("did 12 interviews go out, did any fail?").
- Filter tabs to triage — almost everyone wants to see
donefirst, thenfailed, then everything else. - Per-question card with video + transcript + score side-by-side so the recruiter can spot-check questions they care about without scrubbing through a 20-minute monolith.
- Pre-signed URLs (1h TTL) so we don't leak our S3 bucket but the recruiter doesn't need separate credentials.
- Live polling while
processingso the recruiter can open the link immediately after the candidate finishes and watch results stream in.
- Mandatory hardware check with live face detection — catches the "I forgot to plug in my webcam" / "my flatmate is in the room" cases before time pressure kicks in.
- Real conversation feel — the agent speaks the question, listens, then speaks honest feedback. Far less alienating than a webcam interview with no other human (or AI).
- Silence detection — candidate doesn't have to find a "next" button when they're done; they just stop talking for 4 seconds and the interview advances.
- Progress bar at the bottom so they always know how much is left.
- Persistent lock so the link can't be re-attempted — a clear signal that the recording is final.
All proctoring events go through Socket.IO to a single sink (ProctoringEvent collection). The collection is append-only — we never edit or delete an event. Recruiter dashboard groups them by session and shows them in a timeline against question index. Categories tracked: tab switch, window blur, face absent (>3s), multiple faces, fullscreen exit, copy/paste, camera/mic disconnect, socket reconnect.
- Dark mode + Sora font + a single accent color (#3b82f6). Looks calm, doesn't fatigue the eye over a 20-min interview.
- Transcript shown live during the answer phase — candidates can self-correct if they realise the system misheard a critical word.
- Voice Agent status chip always visible — if something is wrong with the AI, the candidate knows immediately rather than wondering why it stopped responding.
- Locked screen wording is reassuring, not punitive ("This interview is closed" + "contact your recruiter if you think this is a mistake").
- WebSocket fan-in on the API. Each candidate keeps a Voice Agent WS open + a Socket.IO connection. At ~5k concurrent candidates per Node process, the event loop starts to drown.
- S3 chunk PUT throughput. Default S3 limits are generous, but if you have hundreds of candidates all uploading 5s chunks every 5s, you can hit per-prefix request rate limits if all chunks share a prefix. Our keys are
${sessionId}/...so per-prefix fanout is natural. - FFmpeg memory. The worker buffers the merged file into RAM before
putObject. For long interviews (>30 min) we should stream the merged output directly to S3 vialib-storageUpload. - Mongo write contention on the session document. Every chunk POST and every Voice Agent transcript chunk does a write. The chunk list grows unbounded.
- The Voice Agent playback path is the most CPU-intensive thing in the browser (decoding 24k PCM frames). On low-end laptops this can cause MediaRecorder hiccups.
- The Voice Agent upstream WS proxy in the server is currently piping all frames through a single Node process. A binary fast path (no JSON parsing) would help.
- Move the Voice Agent proxy onto its own Node service so REST traffic and WS traffic don't share an event loop.
- Run multiple BullMQ worker pods; the queue already supports this.
- Switch chunk uploads to direct-to-S3 with pre-signed POST so the API server is removed from the data path entirely. The server only records the chunk metadata.
- Move proctoring events to a write-optimised store (e.g. Timescale / Mongo time-series collection) — they grow fast.
- CDN-cache the static client (Vite build → Cloudfront / Cloudflare).
- Use Mongo
$pushwith a$slicecap foruploadedChunksand overflow into a separate collection for very long interviews.
- Winston logger, configured in
server/utils/logger.js. Dev = colorized console; prod = JSON one-line-per-event for shipping into CloudWatch / Datadog / Loki. - Every request has a
requestId(UUID, honors incomingx-request-id). It's set as a response header and included in every log line for that request via a child logger. - Every meaningful domain event is logged with structured metadata:
chunk_uploaded: sessionId, questionIndex, chunkIndex, size, s3Keyproctoring_event: sessionId, type, questionIndexvoice_agent_*: client_connected, upstream_open, transcript_received, upstream_close, etc.worker_job_started/ffmpeg_complete/groq_per_question_complete/worker_job_done/worker_job_failedsession_created/session_status_changed/session_lockedauth_login/auth_register/auth_invalid_token
- HTTP access middleware (
middleware/logger.js) logsmethod, path, status, durationMsper request.
- Unhandled errors hit a single Express error middleware which logs with stack trace and returns a sanitised JSON body (
messageis suppressed in prod). process.on('unhandledRejection')anduncaughtExceptionboth forward to the logger so nothing dies silently.- BullMQ failed jobs emit
worker_job_failedwithjobIdandfailedReason.
- Filter logs by
sessionId— every event for that interview is tagged with it. - Filter logs by
requestIdfor a single REST call. - Failed BullMQ jobs are kept (
removeOnFail: 1000) so you can inspect them via BullMQ's admin tools or by re-running the worker on the failed jobs. - Pre-signed URLs for failed sessions still work; you can manually inspect chunks in S3 at
${sessionId}/${questionIndex}/. - Mongo's session document is the central brain —
db.interviewsessions.findOne({token: "..."})shows you status, chunks, answers, transcripts, lock state.
This codebase was built with heavy AI assistance — here's the honest breakdown.
- Claude for the overall architecture discussion (Voice Agent proxy, the audio worklet, the phase machine in
InterviewPage). - Documentation lookups for the Deepgram Voice Agent WS protocol and BullMQ + Upstash TLS quirks.
- Generate the approach.
- To get familiar with Redis, BullMQ Worker, AWS, ffmpeg, and dive deep into AI Engineering.
- Claude Sonnet, llama for the scaffolding, debug logs and tell the exact causes to every bug.
- We started with a tight, opinionated spec (the one in this README + the original task brief). The AI was given the full spec up front so it could make consistent decisions across modules rather than re-deciding architectures per file.
- For each module the AI was asked to: (a) follow the spec literally, (b) leave inline
// ...comments only where a future maintainer would need them, (c) preserve the same names / paths the spec mentions. - The riskiest part — the Voice Agent message protocol — was written defensively: the hook accepts several variants of transcript / speech-state messages, because Deepgram's Voice Agent shapes have shifted across API revisions.
- Our decisions: the schema, the deterministic chunk-key convention, the "lock + enqueue atomically" pattern, the choice to do per-question Groq during the interview (not after), the choice to keep MediaRecorder separate from Voice Agent audio capture, prompt architecture, exact workflow.
- AI-assisted: boilerplate (routes, Mongoose schemas, MUI dialog forms), the audio worklet for PCM downsampling, the dual-channel playback approach in the Voice Agent hook, the ScoreRing SVG component, README structuring.
- Node 18+
- MongoDB Atlas cluster (or local mongo)
- Upstash Redis (or local redis)
- AWS S3 bucket in
ap-south-1 - Deepgram API key (Voice Agent enabled)
- Groq API key
ffmpeginstalled on the worker host (brew install ffmpeg/apt install ffmpeg)
cd server
cp .env.example .env # fill in values
npm install
npm run dev # api server on :5000
# in a separate terminal:
npm run worker:dev # BullMQ workercd client
cp .env.example .env
npm install
# face-api.js model weights (one-time):
cd public/models
BASE=https://raw.githubusercontent.com/justadudewhohacks/face-api.js/master/weights
curl -sLO "$BASE/tiny_face_detector_model-weights_manifest.json"
curl -sLO "$BASE/tiny_face_detector_model-shard1"
cd ../..
npm run dev # vite dev server on :5173Open http://localhost:5173/recruiter/login, create an account, create a template, create an interview link, open the link in another browser, complete the interview, watch it process.
- Recruiter creates template + interview → dashboard shows the row in
pending. - Candidate opens link → hardware check → enters name → "I'm ready" → InterviewPage.
- Voice Agent connects, speaks Q1 → thinking timer → recording starts → candidate answers → silence detected → "Reviewing..." → agent speaks Groq feedback → next question.
- After last question → agent speaks goodbye → POST status:completed → session is locked + worker enqueued → candidate sees Thank You page → /interview/:token is locked forever.
- Recruiter dashboard auto-refreshes → status moves
completed → processing → done→ recruiter opens the detail page and sees per-question merged videos + transcripts + AI scores + proctoring flags.
ai-interview/
├── server/
│ ├── index.js # Express + Socket.IO + WS upgrade
│ ├── config/{db,queue}.js
│ ├── models/index.js
│ ├── routes/
│ │ ├── auth.js # /api/auth/*
│ │ ├── sessions.js # /api/sessions/* (candidate)
│ │ ├── chunks.js # /api/chunks/:token
│ │ ├── recruiter.js # /api/recruiter/* (JWT)
│ │ └── voiceAgent.js # WS proxy /api/voice-agent/:token
│ ├── middleware/{auth,logger,requestId}.js
│ ├── utils/{s3,logger}.js
│ ├── workers/processor.js # BullMQ worker (FFmpeg + Groq)
│ └── .env.example
└── client/
├── public/models/ # face-api.js weights live here
└── src/
├── pages/{HardwareCheck,InterviewPage,InterviewComplete,InterviewLocked,RecruiterLogin,RecruiterDashboard,SessionDetail}.jsx
├── hooks/{useMediaRecorder,useProctoring,useFaceDetection,useVoiceAgent}.js
├── store/authStore.js
├── utils/api.js
├── components/shared/ProtectedRoute.jsx
├── theme.js
├── main.jsx
└── index.css