Back

Architecture · Long-form

Why we killed the UI — Inside HONO Zero UI's Architecture

Draft · 2026~12 min read

Working draft. The architecture is real and in production. This page is the ongoing public write-up — feedback welcome on LinkedIn.

1. The premise

For 30 years, enterprise HR software competed on interface density. More screens. More fields. More dashboards. HONO Zero UI takes the opposite bet: the interface is not the product. The execution layer is.

If any HR action can be expressed as an intent, and any intent can be routed to an API, then the entire HRMS becomes headless. The UI shrinks to a single conversational surface — one chat window, slash commands, generative components rendered inline. Everything else — payroll, leave, attendance, payslips, recruitment, reimbursement, holidays, employee directory — becomes a backend capability the execution engine composes on demand.

The tagline on the README is honest: “Talk to your HRMS. No forms. No clicks. Just conversation.” What follows is how that actually works.

2. The stack at a glance

Zero UI runs on a deliberately small, modern stack — every choice serves the headless thesis.

  • Frontend: React 19, TypeScript, Vite, TailwindCSS v4, PWA-ready
  • Backend: Express.js, CopilotKit v2 runtime (AG-UI protocol), monolithic by design for deployment simplicity
  • Models: Anthropic Claude (Haiku 4.5, Sonnet 4.6, Opus 4.7), OpenAI GPT-4o family, self-hosted open-source LLMs, Azure AI — routed per-tenant
  • Data: PostgreSQL (Neon serverless) via Prisma ORM
  • Observability: Langfuse for end-to-end AI traces, an internal journey bus for in-app step visibility, an LLM cost dashboard at /admin/llm-costs
  • Embeddings: OpenAI text-embedding-3-small for tool retrieval and the HONO Any-API catalog (~1,429 GraphQL operations indexed)

The product is branded Era — by HONO externally; the engineering repo retains the technical identifier HONO_ZERO_UI.

3. Dual-model routing — cheap classifier, smart executor

Most chat turns don't need a frontier model. Looking at a typical HONO conversation:

  • 60–70% are simple lookups — “what's my leave balance?”, “mark me present.” One tool, format the result, done. A lightweight model handles this perfectly.
  • 20–30% are multi-tool synthesis — “morning briefing.” Four tool calls plus a written summary. A mid-tier model is fine.
  • 5–10% need the smart model — policy-nuance reasoning, ambiguous questions where the wrong tool pick costs three retries.

Zero UI runs every request through two tiers:

Tier 1 · Classification

Lightweight model (default Sonnet 4.6, optional Haiku 4.5). Max 256 tokens. Returns intent + confidence + suggested tools. The system prompt is marked cache_control: ephemeral for Anthropic prompt caching — 5-minute TTL, ~90% discount on cached prefix bytes.

Tier 2 · Execution

Efficient model (default Sonnet 4.6, optional Opus 4.7 for complex tenants). Full tool access via the CopilotKit runtime. The model the classifier suggests informs tool filtering for this layer.

Tenants configure both tiers from the admin Settings page; cache hit/miss telemetry is logged on every request and surfaced in the LLM cost dashboard. The split is feature-flagged so a tenant can fall back to single-model if a workload shape changes.

4. Hybrid tool routing — static intent map + embedding RAG

Once intent is classified, the executor needs a focused tool surface. Showing every tool to every request is wasteful (and hurts accuracy — LLMs get worse, not better, with too many tools in scope). Zero UI uses two routing strategies:

  • Narrow intents (≤5 tools) — direct hit via a static intent → tools allowlist. “Apply leave” deterministically resolves to the three or four tools needed. Zero embedding cost, instant.
  • Broad intents (≥6 tools) — embedding-based top-K retrieval. The full tool corpus is embedded once at boot into an in-memory Map<toolName, vector>; queries are embedded on the fly and matched by cosine similarity.

Both paths run side-by-side in a shadow-mode evaluator that logs which tools each strategy selected. An admin dashboard at /admin/rag-evaluation compares them so we can tighten the cutover threshold based on production data, not guesses.

Layered on top: per-tenant module gating. The enabled_hrms_modules setting filters which tool modules ship to the LLM at all — a hospitality tenant doesn't see manufacturing tools, a payroll-only tenant doesn't see recruitment tools. Module-agnostic tools (marked module: '*') always pass.

5. The Intelligent Execution Engine

The IEE is what makes Zero UI “headless” rather than “a chatbot wrapper around a HRMS.” It composes tool calls, validates inputs, talks to the underlying HONO GraphQL/REST backend, and renders structured results back to the conversation.

Authoring a new capability follows a thin-wrapper pattern. The bulk of HONO operations are GraphQL queries with a single shape — we describe them declaratively via a defineTool factory that bakes in: employee-code gating, per-request throttling, GraphQL invocation, default response wrapping, follow-up suggestion generation, and catch-all error handling. Complex tools that need OTP gating, multi-step orchestration, or external HTTP stay hand-written.

Three safety invariants govern every tool handler:

  • 45-second hard ceiling. The agent stream requires every tool_call to be paired with a tool_result. A hung handler strands the LLM transcript and silently breaks every subsequent message in that thread. The adapter races every handler against a 45-second timeout and synthesises an error result on expiry. Individual network calls have their own shorter, AbortController-driven timeouts inside.
  • Date validation on every write. LLMs reliably misencode dates (“9th April” → 2026-09-04). Every write action funnels user-supplied dates through a shared validator that enforces format AND business window (e.g. futureDays, pastDays). Successful responses echo a human-readable date back (“Friday, September 4, 2026”) so the renderer can surface it prominently — catches residual errors prompt engineering can't.
  • OTP gating on sensitive operations. Payslip access and similar high-trust actions route through an otpService before the underlying API is even called. Conversation is the interface; security is not optional because of it.

6. MCP topology — the external surface

Internally, every HONO subsystem already speaks the same tool vocabulary. The MCP layer exposes that vocabulary to external AI assistants — Claude Desktop, Cursor, Zapier, any MCP-compliant client — so they can act on HONO directly.

Two URL shapes are supported, mounted on the same router:

  • Path-based: /api/:tenant/mcp — tenant resolved from the path parameter.
  • Subdomain-based: <tenant>.mcp.hono.ai/mcp — tenant resolved from the host header. Cleaner to paste into external clients; the preferred form. Slug rules: exactly one level, case-insensitive, tight regex, reserved-subdomain list for confused-deputy safety.

Each MCP API key has a read_only flag. Classification is fail-closed by name prefix — get_, list_, fetch_, search_, view_, preview_ are read; everything else is mutating by default. New actions default to “protected” until a reviewer explicitly opts them into the read scope. This is layered on top of the tenant module filter, so an external read-only key sees the intersection of (tenant's enabled modules) ∩ (read-scope tools).

7. The cost levers — making AI-native economics work

Conversational HRMS sounds expensive until you do the math carefully. Four levers are wired into Zero UI:

LeverEffect
Prompt caching50% off matched input on OpenAI (automatic, ≥1,024-token stable prefix); ~90% off cached reads on Anthropic (opt-in via cache_control). Achieved by moving the date-interpolated block out of the cached prefix into a small dynamic suffix.
Two-tier model routingCheap classifier handles ~60% of turns at a fraction of the execution model's rate.
Budget + alertingMonthly cap per tenant with 80% / 100% alerts. Visible in the admin LLM cost dashboard with per-day, per-user, per-model breakdowns.
Prompt & tool description dietA prompt:audit script that surfaces stable prefix size and per-tool token cost; a COMPACT_TOOL_DESCRIPTIONS flag that trims schemas the LLM doesn't need at runtime.

Combined effect on a typical repeat-user request to OpenAI GPT-4o:

  • Baseline: ~15,000 tokens × $2.50/M = $0.0375/request
  • +Caching + base diet: ~14,000 tokens, ~13,800 cached at 50% = ~$0.019/request (-49%)
  • +Compact descriptions: ~9,000 tokens, ~8,800 cached = ~$0.012/request (-68%)
  • +Two-tier (cheap model handles 60% of turns at 1/10 price) = ~$0.005/request (-87%)

Numbers are illustrative of what the architecture makes possible; actual tenant costs vary by traffic mix and model choice.

8. What's next

  • Any-API Phase 2 — invoke_hono_api. Phase 1 (read-only discovery) is live — the LLM can search the ~1,429-row GraphQL catalog and get back an HMAC-signed, single-use, 5-minute-TTL endpoint envelope. Phase 2 wires the actual invocation, opening the long tail of operations the curated tool set doesn't cover.
  • Anthropic caching on the execution layer. CopilotKit's built-in agent doesn't yet expose a message-level hook for cache_control. We'll either subclass the agent or contribute upstream.
  • Multimodal intents. Voice (phone-calling architecture already in place), document upload as input, on-screen context as input — same execution engine, different intent surface.
  • Third-party MCP marketplace. External systems already plug in. The next step is a curated catalog so any enterprise can participate in the conversation by exposing an MCP server, without bespoke integration work.

The interface is not the product. The execution layer is.

Everything above is in production. Some of it is bleeding edge. All of it is open to feedback.

Building something at the intersection of AI and enterprise? Let's talk.

Back to portfolio