Research Engineer
We route every run across models, and we need to know — measurably, repeatably — which model should get which step, when an agent's output is degrading, and how to evaluate agents that act rather than chat. That is applied research with a four-week loop to production, not a paper mill.
What you will do
Build the evaluation harness for agent runs — task success, side-effect safety, cost — and make it the gate every routing change passes through.
Own model routing research: when does the small model win, and how do we know before the user does.
Detect drift and degradation in production runs automatically, before it becomes a support ticket.
Design experiments with the statistical hygiene to kill our own ideas, then write up what died and why.
Publish what we can — honestly, including negative results.
What we need
Strong empirical ML background — 3+ years in applied research or ML engineering with experiments that shipped.
You are fluent in the current LLM landscape: capabilities, costs, context behavior, and where the benchmarks lie.
Production-grade Python; you profile before you optimize.
Statistical rigor — you know what a fair baseline is and you flag your own confounds first.
You would rather be correct than impressive, in that order, every time.
Nice to have
Published work on evaluation, routing, or multi-step agent behavior.
Experience fine-tuning or distilling models for narrow tasks.
You have built an eval suite a team actually trusted.
Apply — we reply to everyone