Navan logo

Navan

Manager, Site Reliability Engineering at Navan

Tel-Aviv, IsraelFull-timeEngineeringPosted 1 day ago

About the Role

<p>At Navan, we’re committed to creating the best experience for business travelers, ensuring that our systems are always reliable, scalable, and efficient. As we continue to grow, we’re looking for a <strong>Site Reliability Engineering (SRE) Manager</strong> to join our team in headquarters based out of Palo Alto, California. In this role, you will lead a team of SREs, drive innovation in infrastructure design and automation, and ensure our systems run seamlessly at scale, serving thousands of travelers every day.</p> <h3><strong>What You’ll Do</strong></h3> <ul> <li><strong>Lead &amp; Mentor the SRE Team: </strong>Guide and develop a high-performing team of SREs, fostering a culture of collaboration, reliability, and continuous improvement.</li> <li><strong>Drive Infrastructure Reliability &amp; Automation:</strong> Collaborate with Engineering and Product teams to design and implement scalable, fault-tolerant systems. Leverage IaC tools (e.g., Terraform, CloudFormation) and microservices architectures to automate and improve infrastructure.</li> <li><strong>Incident Management:</strong> Improve incident response processes, reduce MTTR, and proactively mitigate risks. Apply resiliency patterns to ensure systems are fault-tolerant and highly available.</li> <li><strong>Define &amp; Measure SLOs:</strong> Develop service-level objectives (SLOs) and KPIs to track and improve system reliability, using tools like NewRelic or DataDog for observability.</li> <li><strong>24x7 Production Support:</strong> Ensure system availability in a 24x7 environment, applying expertise in AWS (e.g., ECS, Lambda, DynamoDB) and database management for optimal performance.</li> <li><strong>Optimize CI/CD Pipelines:</strong> Automate and streamline deployment workflows using tools like Jenkins or GitHub Actions to ensure faster and more reliable deployments.</li> <li><strong>Resource Management:</strong> Manage team resources, including capacity planning, hiring, and upskilling, to meet evolving business needs.</li> </ul> <h3><strong>What We’re Looking For</strong></h3> <ul> <li>8+ years in Site Reliability Engineering, DevOps, or Infrastructure roles, with at least 3 years in a leadership position.</li> <li>Proven ability to lead and mentor teams, fostering a culture of collaboration and reliability.</li> <li>Hands-on experience with AWS cloud technologies, Infrastructure as Code (Terraform/CloudFormation), microservices architectures, deployment automation (Jenkins/GitHub Actions), and observability tools (NewRelic/DataDog).</li> <li>Strong background in designing scalable, fault-tolerant systems, improving incident response, and driving operational improvements.</li> <li>Excellent interpersonal and communication skills, with the ability to work effectively across cross-functional teams.</li> </ul>