Cerebras Systems logo

Cerebras Systems

AI Infrastructure Operations Engineer at Cerebras Systems

Sunnyvale CA or Toronto CanadaFull-timeDeployment Posted about 2 months ago
Apply with Pipeline

About the Role

<div class="content-intro"><p><span data-contrast="none">Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.&nbsp;</span><span data-ccp-props="{&quot;134233117&quot;:false,&quot;134233118&quot;:false,&quot;201341983&quot;:0,&quot;335559685&quot;:0,&quot;335559737&quot;:240,&quot;335559738&quot;:240,&quot;335559739&quot;:240,&quot;335559740&quot;:279}">&nbsp;</span></p> <p>Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups.&nbsp;<a href="https://openai.com/index/cerebras-partnership/">OpenAI recently announced a multi-year partnership with Cerebras</a>, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference.&nbsp;</p> <p>Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.</p></div><p><strong><span data-contrast="auto">About The Role</span></strong></p> <p><span data-contrast="auto">We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power.</span><span data-ccp-props="{}">&nbsp;</span></p> <p><span data-contrast="auto">You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives. This role requires a deep understanding of Linux-based systems, containerization technologies, and experience with monitoring and troubleshooting complex distributed systems. The ideal candidate is a proactive problem-solver with expertise in large-scale compute infrastructure, dependable and an advocate for customer success.&nbsp;</span><span data-ccp-props="{}">&nbsp;</span></p> <h4><strong><span data-contrast="auto">Responsibilities</span></strong></h4> <ul> <li><span data-contrast="auto">Manage and operate multiple advanced AI compute infrastructure clusters.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Monitor and oversee cluster health, proactively identifying and resolving potential issues.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Maximize compute capacity through optimization and efficient resource allocation.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Deploy, configure, and debug container-based services using Docker.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Handle engineering escalations and collaborate with other teams to resolve complex technical challenges.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Contribute to the development and improvement of our monitoring and support processes.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies.</span><span data-ccp-props="{}">&nbsp;</span></li> </ul> <h4><span data-contrast="auto">Skills And Requirements</span></h4> <ul> <li><span data-contrast="auto">6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Strong proficiency in Python scripting for automation and system administration.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Deep understanding of Linux-based compute systems and command-line tools.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Experience with monitoring and alerting systems.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Should have a proven track record to own and drive challenges to completion.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Excellent communication and collaboration skills.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Ability to work effectively in a fast-paced environment.</span><span data-ccp-props="{}">&nbsp;</span></li> <li><span data-contrast="auto">Willingness to participate in a 24/7 on-call rotation.</span><span data-ccp-props="{}">&nbsp;</span></li> </ul> <p><strong><span data-contrast="auto">Preferred Skills And Requirements</span></strong></p> <ul> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" data-aria-posinset="1" data-aria-level="1"><span data-contrast="auto">Operating large scale GPU clusters.</span></li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" data-aria-posinset="1" data-aria-level="1">Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired.</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" data-aria-posinset="1" data-aria-level="1">Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure).</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" data-aria-posinset="1" data-aria-level="1">Familiarity with machine learning frameworks and tools.</li> <li data-leveltext="" data-font="Symbol" data-listid="3" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;multilevel&quot;}" data-aria-posinset="1" data-aria-level="1">Experience with cross-functional team projects.<span data-ccp-props="{}">&nbsp;</span></li> </ul> <h4><span data-contrast="auto">Location</span><span data-ccp-props="{&quot;201341983&quot;:0,&quot;335559739&quot;:0,&quot;335559740&quot;:259}">&nbsp;</span></h4> <ul> <li data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" data-aria-posinset="16" data-aria-level="1"><span data-contrast="auto">SF Bay Area.</span></li> <li data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" data-aria-posinset="16" data-aria-level="1"><span data-contrast="auto">Toronto, Canada.</span></li> <li data-leveltext="" data-font="Symbol" data-listid="1" data-list-defn-props="{&quot;335552541&quot;:1,&quot;335559684&quot;:-2,&quot;335559685&quot;:720,&quot;335559991&quot;:360,&quot;469769226&quot;:&quot;Symbol&quot;,&quot;469769242&quot;:[8226],&quot;469777803&quot;:&quot;left&quot;,&quot;469777804&quot;:&quot;&quot;,&quot;469777815&quot;:&quot;hybridMultilevel&quot;}" data-aria-posinset="16" data-aria-level="1"><span data-contrast="auto">Bangalore, India.</span></li> </ul><div class="content-conclusion"><h4><strong>Why Join Cerebras</strong></h4> <p>People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection&nbsp; point in our business. Members of our team tell us there are five main reasons they joined Cerebras:</p> <ol> <li>Build a breakthrough AI platform beyond the constraints of the GPU.</li> <li>Publish and open source their cutting-edge AI research.</li> <li>Work on one of the fastest AI supercomputers in the world.</li> <li>Enjoy job stability with startup vitality.</li> <li>Our simple, non-corporate work culture that respects individual beliefs.</li> </ol> <p>Read our blog:&nbsp;<a href="https://www.cerebras.net/blog/5-reasons-to-join-cerebras" target="_blank" data-auth="NotApplicable" data-linkindex="0">Five Reasons to Join Cerebras in 2026.</a></p> <h4>Apply today and become part of the forefront of groundbreaking advancements in AI!</h4> <hr> <p><em>Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer.&nbsp;</em><em>We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. </em><em>We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.</em></p> <hr> <p><em>This website or its third-party tools process personal data. For more details, click <a href="https://www.cerebras.net/privacy/" target="_blank">here</a> to review our CCPA disclosure notice.</em></p></div>