Experience: 5+ years production software engineering, with 2+ years working directly on LLM or agent systems in production.
Location: Remote
To streamline and fast-track screening, please submit your details here (if you haven’t already): https://airtable.com/appbtkr4odapnb5I6/pagqo91lKv3VJg3GT/form
We’ll review your responses as part of the initial screening process. Please make sure you complete and submit all details through the form to be considered for the next stage. Submissions outside the form may not be considered.
Why This Role Matters
Terrabase builds agent infrastructure that enterprise customers rely on daily for SQL generation, forecasting, data analysis, and artifact delivery. Our orchestration layer routes between specialized sub-agents, manages typed handoff contracts, runs structured eval suites, and enforces correctness across every turn.
This is not a research-prototype role. You will build and evolve agent architecture, but always in service of making the system observable, typed, evaluated, recoverable, and boringly reliable in production.
What You Will Do
Own the harness architecture and middleware stack. Our LangGraph orchestrator routes between sub-agents through a layered middleware stack: file upload handling, source resolution, local context, workspace sync, state hydration, aggregation barriers, and typed handoff contracts. You will extend this stack, enforce its contracts in code, and keep it operational as routing logic and agent surfaces evolve.
Maintain typed contracts and boundaries. Agent handoffs at Terrabase carry typed contracts with barrier conditions and retry predicates. You will design these contracts, enforce them with strict typing, manage backward compatibility when contracts change, and write the contract tests that prevent silent regressions.
Own the eval suites. We run structured eval suites across routing decisions, context-resolution accuracy, multi-turn coherence, visual reference alignment, and artifact correctness. You will extend coverage, write new evals where gaps exist, and build CI gates that block releases when regressions are detected. A routing change or prompt change with no eval coverage does not ship.
Triage production failures and close the loop. When an agent turn fails in production, you will trace it in LangSmith, identify the failure class, and convert it into a durable regression test. You will own the release gates, keep prompts and runtime contracts in sync, manage feature flag rollout risk, and remove dead paths as the system evolves.
Own SQL and artifact correctness. Our agents generate SQL over customer schemas and produce structured artifacts (reports, dashboards, data sheets) under a strict schema contract. You will own the correctness layer: source grounding, schema-aware validation, provenance surfaces, and the eval infrastructure that catches generated artifact failures before they reach customers.
Build and maintain HITL workflows. Human-in-the-loop checkpoints let users intervene, redirect, or approve mid-chain. You will design these workflows, enforce their resumable state contracts, and ensure they degrade gracefully when interrupted.
Instrument for traceability. You will extend LangSmith tracing coverage, add structured span annotations, and build the tooling that lets us diagnose a bad agent turn from production trace data alone, without requiring a local reproduction.
What We Are Looking For
- 5+ years production software engineering, with strong Python fundamentals
- 2+ years working hands-on with LLM-based systems: agent loops, tool use, context management, or inference pipelines
- Experience with LangGraph, LangChain, OpenAI/Anthropic tool-use systems, or equivalent multi-step agent/runtime orchestration
- Practical eval engineering: you have built or extended eval harnesses, written automated test cases for agent behavior, and treated evaluation as an ongoing engineering discipline
- Strong engineering hygiene: strict typing, small interfaces, contract tests, clear schema migrations, and CI discipline
- Ability to debug from production traces and artifacts, not only local reproductions
- Comfort working across prompts, Python runtime code, TypeScript product surfaces, data systems, and eval infrastructure
- Systems thinking: you design for observability, recovery, and state management, not just the happy path
- Maintenance ownership mindset: you triage, close loops, and leave systems more debuggable than you found them
- Pragmatic judgment: you can distinguish between reliability-critical infrastructure and speculative abstraction
Bonus Points
- HITL workflow design: checkpoints, approvals, mid-chain interrupts, resumable state
- Context engineering depth: chunking strategies, retrieval-augmented generation, semantic routing, re-ranking
- Experience with LangSmith, Weights and Biases, or similar trace and evaluation platforms
- Prior work shipping agent systems to enterprise customers where SQL or data correctness is a hard requirement
- Experience with mypy, Pydantic contracts, or strict typing disciplines in a production Python codebase
Life at Terrabase
We are a sharp, focused, fully remote team building agent infrastructure that enterprise customers trust with their data. You will work directly alongside the engineer who designed this harness, with broad ownership, generous compute budgets, and a culture that treats reliability as a product requirement, not a research topic.
Terrabase is an equal-opportunity employer. We celebrate diversity and are committed to building an inclusive environment for every team member.