Context
You’re interviewing with FinPilot, a fintech SaaS platform that provides automated bookkeeping and cash-flow forecasting for ~120K businesses (SMBs + mid-market). FinPilot has launched a new AI Agent feature that can: (1) reconcile transactions, (2) draft invoices, (3) answer “why did my cash drop?” questions, and (4) generate monthly close checklists. The agent is monetized via credits: customers buy credit packs and each agent run consumes credits based on model, tokens, tools used, and runtime.
Today, product analytics is stitched together from application logs and a daily billing export. This has caused multiple incidents: customers disputing invoices because retries double-counted credits, the support team lacking a consistent view of “what happened in this run,” and engineering being blind to rising tool-call error rates that correlate with a specific model rollout. Leadership wants a single source of truth for agent usage that supports: billing accuracy, near-real-time monitoring, and self-serve analytics.
Scale & Freshness Requirements
- Traffic: ~2.5M agent runs/day average, 10M/day peak during month-end close.
- Event volume: Each run emits 10–200 events (start, tool calls, token usage, retries, errors, finish). Peak ~50K events/sec.
- Latency:
- Operational dashboards (errors, latency, credit burn): P95 < 2 minutes end-to-end.
- Finance/billing correctness: daily close with T+1 finalization, but must support late-arriving events up to 7 days.
- Storage: retain raw events 90 days; curated fact tables 2+ years.
- Correctness: credits must be exactly-once billed per run, even with retries and partial failures.
Data Characteristics & Quality Issues
Events come from multiple producers:
- Agent Orchestrator service (Kubernetes): emits lifecycle events (run_started, run_completed, run_failed).
- LLM Gateway: emits token usage (prompt_tokens, completion_tokens), model name/version, and request IDs.
- Tool execution service: emits tool_call_started/tool_call_finished, tool name, latency, and error codes.
- Billing service: emits credit ledger mutations (credit_granted, credit_reserved, credit_charged, credit_refunded).
Common issues you must design for:
- Retries: same run_id may have multiple attempts; tool calls can be retried with the same idempotency key.
- Out-of-order delivery: tool events can arrive before run_started due to buffering.
- Schema evolution: new tools and new models add fields frequently.
- PII/regulated data: prompts may contain customer financial details; raw prompts must not land in analytics without redaction.
Your Task
Design a complete pipeline + data model to track AI Agent usage, credit consumption, and error rates. Your design must support both near-real-time monitoring and billing-grade accuracy.
Functional Requirements
- Canonical event model for agent usage that supports lifecycle, attempts, tool calls, token usage, and errors.
- Credit accounting model that ties credit charges/refunds to a run and prevents double billing.
- Error-rate analytics by tool, model, customer tier, and agent workflow, with drill-down to run/attempt.
- Late-arriving and out-of-order events: correct aggregates when events arrive up to 7 days late.
- Idempotent ingestion: duplicates must not inflate usage or credits.
- Support backfills for historical reprocessing (e.g., model pricing change, bug fix in token accounting).
Non-Functional Requirements
- Auditability: finance can trace an invoice line item back to the raw events and ledger entries.
- Data privacy: do not store raw prompts; store hashes/metadata only; enforce least privilege.
- High availability: pipeline should tolerate a Kafka broker outage and a streaming job restart without data loss.
- Cost control: keep incremental platform spend under $40K/month.
Constraints
- Cloud: AWS. Warehouse: Snowflake (already used by Finance).
- Existing tools: Kafka (MSK), Airflow 2.x, dbt, S3.
- Team: 5 data engineers; strong SQL/dbt skills, moderate Spark experience.
- Compliance: SOC2; prompts and tool payloads may contain sensitive financial info.
Deliverables (What the interviewer expects)
- A proposed warehouse schema (facts/dimensions) with keys, grain, and example columns.
- How you’ll compute:
- credits consumed per run/day/customer
- error rates (by tool/model/workflow)
- token usage and cost attribution
- How you’ll handle deduplication, retries, late events, and backfills.
- Monitoring, alerting, and failure recovery strategies.