
Senior AI Operations (AI Ops) Engineer at Navan
Palo Alto, CAFull-timeEngineeringPosted 29 days ago
About the Role
<p>At Navan, we aren't building a single, generic chatbot. We are building a <strong>Composable AI Microservice Architecture</strong>, a swarm of hundreds of hyper-specialized AI services, each meticulously "programmed" to solve small, focused tasks with high precision. This fleet powers <strong>Ava</strong>, our AI support engine, and a suite of cutting-edge generative tools for travel and expense management.</p>
<p>As a <strong>Senior AI Operations (AI Ops) Engineer</strong>, you are the architect of the platform that makes this scale possible. You will move beyond traditional MLOps to manage a "factory" of Language Models. Your challenge is one of orchestration and standardization, ensuring that every service in the swarm meets a rigorous bar for quality, reliability, and cost-efficiency.</p>
<h3>What You’ll Do</h3>
<ul>
<li><strong>Orchestrate the AI Fleet:</strong> Build and own the runtime environment for 100+ specialized AI services. Manage model routing, context versioning, and standardized memory/history stores.</li>
<li><strong>High-Density Inference Optimization:</strong> Design and implement <strong>SageMaker Multi-Model Endpoints (MME)</strong> and Inference Components to serve multiple tuned SLMs per GPU, maximizing hardware utilization while minimizing latency.</li>
<li><strong>Deterministic Service Excellence:</strong> Treat reliability as a layered engineering problem. Build deterministic "shells" around probabilistic LM outputs, prioritizing data-layer validation and strict serialization.</li>
<li><strong>Automated Evaluation & Observability:</strong> Implement "LLM-as-a-judge" patterns and automated benchmarking to detect semantic drift and hallucinations across the fleet before they impact the user.</li>
<li><strong>Standardize the Workflow:</strong> Obsess over building reusable patterns and Terraform-based infrastructure that eliminate "snowflake" configurations, allowing us to deploy new specialized AI tasks in minutes.</li>
<li><strong>Agency Strategy:</strong> Partner with AI Researchers to find the "Goldilocks zone" for agentic autonomy—balancing the flexibility of LLM tool-use with the precision required for production stability.</li>
</ul>
<h3>What We’re Looking For</h3>
<ul>
<li><strong>Experience:</strong> 5+ years in SRE, Platform Engineering, or MLOps, with at least 2 years focused on deploying LLMs/SLMs in production environments.</li>
<li><strong>SageMaker Mastery:</strong> Deep hands-on expertise with <strong>AWS SageMaker</strong>, specifically configuring Multi-Model Endpoints (MME), Inference Components, and GPU-backed instances (G5/P4).</li>
<li><strong>SLM Expertise:</strong> Proven experience with Small Language Models (e.g., Mistral, Llama 3, Phi) and parameter-efficient fine-tuning (PEFT) deployment strategies like <strong>LoRA/QLoRA</strong>.</li>
<li><strong>Technical Stack:</strong> * <strong>Languages:</strong> Strong proficiency in Python and Terraform.</li>
<ul>
<li><strong>Orchestration:</strong> Experience with Docker, Kubernetes (EKS), or AWS ECS/Fargate.</li>
<li><strong>Data:</strong> Familiarity with Snowflake and Vector Databases.</li>
</ul>
<li><strong>The "AI Ops" Mindset:</strong> You understand that AI at scale is a statistical challenge. You are comfortable debugging issues at the data/serialization layer rather than defaulting to prompt tweaks.</li>
<li><strong>CI/CD & Automation:</strong> Experience building robust pipelines (Jenkins, GitHub Actions) for non-deterministic software, including automated "eval" stages.</li>
<li><strong>Education:</strong> BS or MS in Computer Science, Engineering, Mathematics, or a related technical field.</li>
</ul><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>The posted pay range represents the anticipated low and high end of the compensation for this position and is subject to change based on business need. To determine a successful candidate’s starting pay, we carefully consider a variety of factors, including primary work location, an evaluation of the candidate’s skills and experience, market demands, and internal parity.<br><br>For roles with on-target-earnings (OTE), the pay range includes both base salary and target incentive compensation. Target incentive compensation for some roles may include a ramping draw period. Compensation is higher for those who exceed targets. Candidates may receive more information from the recruiter.</p></div><div class="title">Pay Range</div><div class="pay-range"><span>$116,100</span><span class="divider">—</span><span>$258,000 USD</span></div></div></div>
Related Roles
Senior Site Reliability Engineer
Navan
Austin, TX; Dallas, TXEngineering Director, Specialty Travel
Navan
London, UKSite Reliability Engineer - 2
Navan
Palo Alto, CAManager, Site Reliability Engineering
Navan
Tel-Aviv, IsraelSenior Software Engineer - Developer Experience (DevEX)
Navan
Palo Alto, CASenior Product Growth
Navan
Tel-Aviv, Israel