Agent Observability
Every agent-native app gets observability out of the box. Traces, automated evals, user feedback, and A/B experiments work with zero configuration — all data lives in the app's own SQL database.
What's Captured Automatically
When a user sends a message, the framework automatically records:
- Token usage — input, output, cache read, cache write
- Cost — computed from token counts and model pricing
- Latency — total duration and time per tool call
- Tool calls — which actions were invoked, success/error status, duration
- Automated evals — 5 quality scores computed after every run
No code changes needed. The instrumentation hooks into production-agent.ts transparently.
The Dashboard
Add the dashboard to any template with a single route:
// app/routes/observability.tsx
import { ObservabilityDashboard } from "@agent-native/core/client";
export default function ObservabilityPage() {
return (
<div className="min-h-screen bg-background p-6">
<ObservabilityDashboard />
</div>
);
}The dashboard has 5 tabs:
| Tab | What it shows |
|---|---|
| Overview | Key metrics — runs, cost, latency, tool success rate, satisfaction, eval score |
| Conversations | Trace list with drill-down to individual spans (agent_run, llm_call, tool_call) |
| Evals | Automated eval scores by criteria, trends over time |
| Experiments | A/B test list with status badges, variant results with confidence intervals |
| Feedback | Thumbs up/down stream, category breakdown, frustration scores |
User Feedback
Explicit Feedback
Thumbs up/down buttons render inline on every agent message in the chat UI. Thumbs down opens a category popover (Inaccurate, Not helpful, Wrong tool, Too slow). This is wired into AssistantChat.tsx automatically.
Implicit Feedback (Frustration Index)
The framework computes a Frustration Index (0-100) from conversation signals:
| Signal | Weight | What it detects |
|---|---|---|
| Rephrasing | 30% | User repeats similar messages |
| Retry patterns | 20% | "Try again", "no that's wrong" |
| Abandonment | 20% | Session ends shortly after response |
| Sentiment | 15% | Negative language patterns |
| Length trend | 15% | Declining message lengths |
Score interpretation: 0-20 = healthy, 20-40 = friction, 40-60 = dissatisfied, 60+ = broken session.
Automated Evals
Five deterministic scorers run after every agent run:
| Criteria | What it measures | Score range |
|---|---|---|
tool_success_rate |
% of tool calls without errors | 0-1 |
step_efficiency |
Penalizes excessive LLM iterations for tool-using runs | 0-1 |
latency_score |
Normalized against 10s/tool baseline | 0-1 |
cost_efficiency |
Normalized against cost baseline | 0-1 |
error_recovery |
Did the agent recover from tool errors? | 0 or 1 |
LLM-as-Judge (Optional)
Enable sampled LLM-based evaluation by setting evalSampleRate:
import { putSetting } from "@agent-native/core/settings";
await putSetting("observability-config", {
enabled: true,
evalSampleRate: 0.05, // 5% of runs
});Custom criteria use natural language rubrics:
const criteria = {
name: "helpfulness",
description: "Was the response helpful and complete?",
rubric: "0.0 = unhelpful, 0.5 = partially helpful, 1.0 = fully resolved",
};A/B Experiments
Test different models, temperatures, or agent configurations:
// Create via API
POST /_agent-native/observability/experiments
{
"name": "sonnet-vs-haiku",
"variants": [
{ "id": "control", "weight": 50, "config": { "model": "claude-sonnet-4-6" } },
{ "id": "treatment", "weight": 50, "config": { "model": "claude-haiku-4-5-20251001" } }
],
"metrics": ["cost", "latency", "satisfaction"]
}
// Start the experiment
PUT /_agent-native/observability/experiments/:id
{ "status": "running" }The agent loop automatically resolves the user's variant and applies the config override. Assignment uses consistent hashing — same user always gets the same variant.
Configuration
All settings are stored in the observability-config key:
{
enabled: true, // Master switch
capturePrompts: false, // Store prompt content in traces
captureToolArgs: false, // Store action input arguments
captureToolResults: false, // Store action results
evalSampleRate: 0, // 0-1, fraction of runs to LLM-judge
exporters: [] // OTLP export targets
}Content is redacted by default — only token counts, costs, and timing are stored. Opt in to content capture when needed for debugging.
API Endpoints
All auto-mounted at /_agent-native/observability/:
| Method | Path | Purpose |
|---|---|---|
| GET | / |
Overview stats |
| GET | /traces |
List trace summaries |
| GET | /traces/:runId |
Trace detail (summary + spans) |
| GET | /traces/:runId/evals |
Evals for a run |
| POST | /feedback |
Submit feedback |
| GET | /feedback |
List feedback |
| GET | /feedback/stats |
Feedback aggregation |
| GET | /satisfaction |
Satisfaction scores |
| GET | /evals/stats |
Eval statistics |
| POST | /experiments |
Create experiment |
| GET | /experiments |
List experiments |
| PUT | /experiments/:id |
Update experiment |
| POST | /experiments/:id/results |
Compute results |
| GET | /experiments/:id/results |
Get results |
All endpoints support ?since=N (ms timestamp) and ?limit=N query params.
Export to External Platforms
Send traces to Langfuse, Datadog, Grafana, or any OTel-compatible backend:
await putSetting("observability-config", {
enabled: true,
exporters: [
{
type: "otlp",
endpoint: "https://cloud.langfuse.com/api/public/otel",
headers: { Authorization: "Bearer sk-..." },
},
],
});The framework emits gen_ai.* semantic convention spans compatible with the OpenTelemetry GenAI spec.
Error Reporting (Sentry)
Server-side errors that escape Nitro route handlers are reported to Sentry when a DSN is configured. Without it the SDK silently no-ops, so it's safe to leave the env vars unset in dev. Browser and server events can go to the same Sentry project; split them into separate projects only when you want operational separation for ownership, volume, quotas, or alert routing.
| Surface | SDK | Env var | Notes |
|---|---|---|---|
| Browser / SPA | @sentry/browser |
VITE_SENTRY_CLIENT_DSN, SENTRY_CLIENT_DSN, or SENTRY_DSN |
Captures unhandled errors and route-change breadcrumbs in the client. |
| Nitro server | @sentry/node |
SENTRY_SERVER_DSN or SENTRY_DSN |
Captures 5xx responses and Nitro lifecycle errors. Per-request user. |
agent-native CLI |
@sentry/node |
hardcoded | Crash reports from the published CLI binary; not user-configurable. |
Server-side configuration
Set SENTRY_SERVER_DSN or the shared SENTRY_DSN in the deploy environment (Netlify dashboard, Cloudflare secrets, etc.). The framework auto-mounts a Nitro plugin that:
- Calls
Sentry.initonce at startup (idempotent — safe to call from multiple plugins). - Resolves the user via
getSession(event)on every API/framework request and attachesid/email/usernameplus anorgIdtag to Sentry's per-request isolation scope. Static-asset paths are skipped to avoid extra DB hits. - Captures every framework-route 5xx with searchable
route,method, anduserAgenttags.
Optional knobs:
SENTRY_SERVER_TRACES_SAMPLE_RATE(float0–1) — opt in to performance tracing. Defaults to0(errors only). Invalid values clamp to0.AGENT_NATIVE_RELEASE— overrides thereleasetag. Defaults toagent-native-server@<core-version>.
Templates
Every template inherits this automatically — there's nothing to import. For SSR apps, the server injects a tiny browser config script when SENTRY_CLIENT_DSN, VITE_SENTRY_CLIENT_DSN, or shared SENTRY_DSN is available at runtime, so browser capture is not limited to Vite build-time env. Templates that want custom behavior (extra tags, different DSN per template, hard-disable Sentry) can override by exporting their own plugin from server/plugins/sentry.ts:
// server/plugins/sentry.ts
import { createSentryPlugin } from "@agent-native/core/server";
export default createSentryPlugin();The CLI's hardcoded DSN is intentional — the published binary needs to phone home crashes regardless of which environment runs it. The server module never hardcodes a DSN because it runs inside customer environments where operators decide whether errors should reach Sentry at all.
Privacy & PII
Both server and CLI initialize with sendDefaultPii: false and a beforeSend hook that strips:
request.headers.authorization,cookie,set-cookie,proxy-authorizationrequest.cookiesuser.ip_address(auto-collected without consent)contexts.runtime_env(process env snapshot)- Any event whose top-level exception type is
ValidationError(treated as expected user-input rejection, not a bug).
Identity fields explicitly set via setUser({ id, email, username }) are preserved.