Navan logo

Navan

Senior AI Operations (AI Ops) Engineer at Navan

Palo Alto, CAFull-timeEngineeringPosted 29 days ago

About the Role

<p>At Navan, we aren't building a single, generic chatbot. We are building a <strong>Composable AI Microservice Architecture</strong>, a swarm of hundreds of hyper-specialized AI services, each meticulously "programmed" to solve small, focused tasks with high precision. This fleet powers <strong>Ava</strong>, our AI support engine, and a suite of cutting-edge generative tools for travel and expense management.</p> <p>As a <strong>Senior AI Operations (AI Ops) Engineer</strong>, you are the architect of the platform that makes this scale possible. You will move beyond traditional MLOps to manage a "factory" of Language Models. Your challenge is one of orchestration and standardization, ensuring that every service in the swarm meets a rigorous bar for quality, reliability, and cost-efficiency.</p> <h3>What You’ll Do</h3> <ul> <li><strong>Orchestrate the AI Fleet:</strong> Build and own the runtime environment for 100+ specialized AI services. Manage model routing, context versioning, and standardized memory/history stores.</li> <li><strong>High-Density Inference Optimization:</strong> Design and implement <strong>SageMaker Multi-Model Endpoints (MME)</strong> and Inference Components to serve multiple tuned SLMs per GPU, maximizing hardware utilization while minimizing latency.</li> <li><strong>Deterministic Service Excellence:</strong> Treat reliability as a layered engineering problem. Build deterministic "shells" around probabilistic LM outputs, prioritizing data-layer validation and strict serialization.</li> <li><strong>Automated Evaluation &amp; Observability:</strong> Implement "LLM-as-a-judge" patterns and automated benchmarking to detect semantic drift and hallucinations across the fleet before they impact the user.</li> <li><strong>Standardize the Workflow:</strong> Obsess over building reusable patterns and Terraform-based infrastructure that eliminate "snowflake" configurations, allowing us to deploy new specialized AI tasks in minutes.</li> <li><strong>Agency Strategy:</strong> Partner with AI Researchers to find the "Goldilocks zone" for agentic autonomy—balancing the flexibility of LLM tool-use with the precision required for production stability.</li> </ul> <h3>What We’re Looking For</h3> <ul> <li><strong>Experience:</strong> 5+ years in SRE, Platform Engineering, or MLOps, with at least 2 years focused on deploying LLMs/SLMs in production environments.</li> <li><strong>SageMaker Mastery:</strong> Deep hands-on expertise with <strong>AWS SageMaker</strong>, specifically configuring Multi-Model Endpoints (MME), Inference Components, and GPU-backed instances (G5/P4).</li> <li><strong>SLM Expertise:</strong> Proven experience with Small Language Models (e.g., Mistral, Llama 3, Phi) and parameter-efficient fine-tuning (PEFT) deployment strategies like <strong>LoRA/QLoRA</strong>.</li> <li><strong>Technical Stack:</strong> * <strong>Languages:</strong> Strong proficiency in Python and Terraform.</li> <ul> <li><strong>Orchestration:</strong> Experience with Docker, Kubernetes (EKS), or AWS ECS/Fargate.</li> <li><strong>Data:</strong> Familiarity with Snowflake and Vector Databases.</li> </ul> <li><strong>The "AI Ops" Mindset:</strong> You understand that AI at scale is a statistical challenge. You are comfortable debugging issues at the data/serialization layer rather than defaulting to prompt tweaks.</li> <li><strong>CI/CD &amp; Automation:</strong> Experience building robust pipelines (Jenkins, GitHub Actions) for non-deterministic software, including automated "eval" stages.</li> <li><strong>Education:</strong> BS or MS in Computer Science, Engineering, Mathematics, or a related technical field.</li> </ul><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>The posted pay range represents the&nbsp;anticipated&nbsp;low and high end of the compensation for this position and is subject to change based on business need. To determine a successful candidate’s starting pay, we carefully consider a variety of factors, including primary work location, an evaluation of the candidate’s skills and experience, market demands, and internal parity.<br><br>For roles with on-target-earnings (OTE), the pay range includes both base salary and target incentive compensation. Target incentive compensation for some roles may include a ramping draw period. Compensation is higher for those who exceed targets. Candidates may receive more information from the recruiter.</p></div><div class="title">Pay Range</div><div class="pay-range"><span>$116,100</span><span class="divider">&mdash;</span><span>$258,000 USD</span></div></div></div>