Member of Technical Staff, Cluster Management at Fireworks AI

San Mateo, CAFull-timeEngineeringPosted 29 days ago

About the Role

<div class="content-intro"><h2><strong>About Us:</strong></h2> <p data-start="107" data-end="729">At Fireworks, we’re building the future of generative AI infrastructure. Our platform delivers the highest-quality models with the fastest and most scalable inference in the industry. We’ve been independently benchmarked as the leader in LLM inference speed and are driving cutting-edge innovation through projects like our own function calling and multimodal models. Fireworks is a Series C company valued at $4 billion and backed by top investors including Benchmark, Sequoia, Lightspeed, Index, and Evantic. We’re an ambitious, collaborative team of builders, founded by veterans of Meta PyTorch and Google Vertex AI.</p></div><h2>The Role:</h2> <p>As a Member of Technical Staff, Cluster Management at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient. You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence. You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms. This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes.</p> <h2>Key Responsibilities:</h2> <ul> <li><strong>Ensuring System Reliability:</strong> Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure.</li> <li><strong>Incident Management & Response:</strong> Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability.</li> <li><strong>Observability & Monitoring:</strong> Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.</li> <li><strong>Automation & Toil Reduction:</strong> Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management.</li> <li><strong>Capacity Planning & Performance Tuning:</strong> Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization.</li> <li><strong>Reliability Best Practices:</strong> Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence.</li> <li><strong>On-call Rotation:</strong> Participate in a periodic on-call rotation to support our production environment and respond to critical alerts.</li> </ul> <h2>Minimum qualifications:</h2> <ul> <li>Bachelor's degree in Computer Science, related technical field, or equivalent practical experience.</li> <li>5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems.</li> <li>Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems.</li> <li>Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services.</li> <li>Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes).</li> <li>Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing.</li> <li>Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development.</li> <li>In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging.</li> <li>Proven ability to troubleshoot complex issues across the entire stack.</li> <li>Excellent communication, collaboration, and problem-solving skills.</li> <li>Willingness to participate in on-call rotations.</li> </ul> <h2>Preferred qualifications:</h2> <ul> <li>Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing.</li> <li>Experience with machine learning infrastructure, model serving, or distributed AI frameworks.</li> <li>Hands-on experience in security and data protection.</li> </ul> <p> </p><div class="content-pay-transparency"><div class="pay-input"><div class="description"><p>Total compensation for this role also includes meaningful equity in a fast-growing startup, along with a competitive salary and comprehensive benefits package. Base salary is determined by a range of factors including individual qualifications, experience, skills, interview performance, market data, and work location. The listed salary range is intended as a guideline and may be adjusted.</p></div><div class="title">Base Pay Range (Plus Equity)</div><div class="pay-range"><span>$175,000</span><span class="divider">—</span><span>$220,000 USD</span></div></div></div><div class="content-conclusion"><h2><strong>Why Fireworks AI?</strong></h2> <ul> <li>Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.</li> <li>Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.</li> <li>Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.</li> <li>Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.</li> </ul> <p><em>Fireworks AI is an equal-opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all innovators.</em></p></div>

About the Role

About the Role

Related Roles

About the Role

Related Roles