Nebius logo

Nebius

Senior Software Engineer (Data Platform, C++) at Nebius

Germany; Israel; Netherlands; Prague, Czech Republic; United KingdomFull-timeBackendPosted 15 days ago

About the Role

<div class="content-intro"><p><strong>About Nebius:</strong></p> <p>Nebius is leading a new era in cloud infrastructure for the global AI economy. We are building a full-stack AI cloud platform that supports developers and enterprises from data and model training through to production deployment, without the cost and complexity of building large in-house AI/ML infrastructure.</p> <p>Built by engineers, for engineers. From large-scale GPU orchestration to inference optimization, we own the hard problems across compute, storage, networking and applied AI.</p> <p>Listed on Nasdaq (NBIS) and headquartered in Amsterdam, we have a global footprint with R&amp;D hubs across Europe, the UK, North America and Israel. Our team of 1,500+ includes hundreds of engineers with deep expertise across hardware, software and AI R&amp;D.</p></div><h4 id="The-role" data-local-id="c41794725a88" data-renderer-start-pos="1">The role</h4> <p data-renderer-start-pos="11" data-local-id="b628b6030a86">We’re looking for a <strong data-renderer-mark="true">Software Engineer with strong C++ expertise</strong> to join the team building and operating <strong data-renderer-mark="true">Nebius Data Platform</strong> — a distributed storage and a processing platform that acts as the company’s “source of truth” and the backbone of many internal (and some external) products.</p> <p data-renderer-start-pos="292" data-local-id="de1e1b789ce4">Nebius Data Platform is a <strong data-renderer-mark="true">single multi-tenant ecosystem based on YTsaurus</strong> — instead of running separate HDFS/Kafka/HBase-style systems, we provide storage, compute, and analytics capabilities inside one platform.</p> <p data-renderer-start-pos="506" data-local-id="bcbb3787b5d5">Built on top of the open-source YTsaurus ecosystem, we run and extend our own Nebius distribution and develop significant in-house functionality (core and platform-level). We can design, implement, and roll out features end-to-end on our clusters without waiting for upstream approvals and contribute upstream when it makes sense.</p> <p data-renderer-start-pos="838" data-local-id="7544511b100f">At scale today, this includes<strong data-renderer-mark="true">~500 servers, ~20k CPU cores</strong> and&nbsp;<strong data-renderer-mark="true">~10 PB of compressed</strong> <strong data-renderer-mark="true">data</strong> in our largest production cluster, supporting workloads ranging from business-critical pipelines and financial transactions to large-scale ML/LLM training datasets and compute.</p> <h4 id="What’s-inside-the-platform" data-local-id="2100fd8e190a" data-renderer-start-pos="1105">What’s inside the platform</h4> <p data-renderer-start-pos="1133" data-local-id="76f8ed17de21">You’ll work on a system that includes (and ties together):</p> <ul> <li data-renderer-start-pos="1195" data-local-id="6ddeb26cab43"><strong data-renderer-mark="true">Distributed Storage (Cypress)</strong>: transactional semantics, tiered storage, erasure coding, replication, and strong reliability expectations.</li> <li data-renderer-start-pos="1336" data-local-id="13562646bfb3"><strong data-renderer-mark="true">Compute &amp; ETL</strong>: a cluster-wide job scheduler (tens of thousands of cores), MapReduce, <strong data-renderer-mark="true">YQL</strong> for SQL-like data processing, and <strong data-renderer-mark="true">SPYT (Spark over YTsaurus)</strong> for modern data engineering.</li> <li data-renderer-start-pos="1518" data-local-id="97897e1cbf8b"><strong data-renderer-mark="true">Interactive analytics (CHYT)</strong>: ClickHouse® instances spun up directly on compute nodes for fast SQL over data in-place.</li> <li data-renderer-start-pos="1640" data-local-id="02d7848ef2e9"><strong data-renderer-mark="true">Dynamic Tables</strong>: low-latency NoSQL KV with distributed ACID transactions for OLTP-style workloads and feature stores.</li> <li data-renderer-start-pos="1760" data-local-id="bcaf763a8a8a"><strong data-renderer-mark="true">Orchestracto</strong>: workflow orchestration deeply integrated with the platform (Airflow-like, but platform-native).</li> </ul> <h4 id="What-you’ll-do" data-local-id="ef3970644491" data-renderer-start-pos="1873">What you’ll do</h4> <p data-renderer-start-pos="1889" data-local-id="76f1e15bff02">We’re looking for engineers who combine strong systems skills with <strong data-renderer-mark="true">product sense</strong>: understanding who uses the platform, why certain capabilities matter, and making pragmatic trade-offs to maximize impact. On our team, engineering work is expected to be connected to real users and outcomes — you’ll regularly align with internal stakeholders, clarify requirements, and help drive prioritization.</p> <p data-renderer-start-pos="2285" data-local-id="55415fa782de">In this role, you will:</p> <ul> <li data-renderer-start-pos="2312" data-local-id="976b454233f2"><strong data-renderer-mark="true">Design and implement new functionality in YTsaurus core</strong> (C++) with production reliability in mind.</li> <li data-renderer-start-pos="2414" data-local-id="2003d4e11122"><strong data-renderer-mark="true">Build and evolve platform-level capabilities:</strong> platform architecture and operating model—multi-cluster growth, shared primitives, and a consistent experience that scales with new teams and use cases.</li> <li data-renderer-start-pos="2616" data-local-id="fd06f5a1ee0e">Improve <strong data-renderer-mark="true">end-to-end platform experience</strong> for internal (and external-facing) users: APIs, guardrails, debugging workflows, and automation.</li> <li data-renderer-start-pos="2755" data-local-id="c2f3a7ce1432">Own production quality: <strong data-renderer-mark="true">incident response / on-call rotation</strong>, root cause analysis, and turning learnings into durable fixes.</li> </ul> <h4 id="Example-projects" data-local-id="e54b80d154d0" data-renderer-start-pos="2883">Example projects</h4> <ul> <li data-renderer-start-pos="2903" data-local-id="67c11986cf10">Roll out sharded YTsaurus masters (incl. Kubernetes operator support) and build automatic balancing of metadata across master cells (consensus groups) to remove control-plane bottlenecks and <strong data-renderer-mark="true">unlock 10–100x cluster growth</strong>.</li> <li data-renderer-start-pos="3128" data-local-id="61b805a294c7"><strong data-renderer-mark="true">Make CHYT interactive SQL faster and more predictable at high load</strong> via performance work like data-skipping / min-max-style indexes and improved execution introspection.</li> <li data-renderer-start-pos="3300" data-local-id="8dcfe4b24784">Turn <strong data-renderer-mark="true">Orchestracto into a platform product </strong>by defining the building blocks, developer experience, and governance for how teams create and share workflows.</li> <li data-renderer-start-pos="3457" data-local-id="b8561c1450f0"><strong data-renderer-mark="true">Scale and harden Parquet-on-S3 for native YTsaurus workloads</strong> by tackling replication/movement, consistent lifecycle semantics, and master-server metadata optimizations for performance and reliability.</li> <li data-renderer-start-pos="3661" data-local-id="55b0019ca4db">Design and ship <strong data-renderer-mark="true">complete, trustworthy audit trails for data changes</strong> (who/what/when) across heterogeneous storage and compute paths.</li> </ul> <h4 id="Tech-stack" data-local-id="5ea174565779" data-renderer-start-pos="3796">Tech stack</h4> <ul> <li data-renderer-start-pos="3810" data-local-id="7558ae4388dc">Core: <strong data-renderer-mark="true">modern C++</strong> (C++20, async + multithreaded primitives)</li> <li data-renderer-start-pos="3872" data-local-id="44f7dddeadcc">Services &amp; tooling: <strong data-renderer-mark="true">Go</strong> and <strong data-renderer-mark="true">Python</strong> (microservices, utilities, integration tests)</li> </ul> <h4 id="What-we-expect" data-local-id="47ea8a0da9c3" data-renderer-start-pos="3955">What we expect</h4> <ul> <li data-renderer-start-pos="3973" data-local-id="192febeef03b"><strong data-renderer-mark="true">5+ years</strong> of software engineering experience.</li> <li data-renderer-start-pos="4021" data-local-id="1b2ced524f22">Strong <strong data-renderer-mark="true">C++</strong> skills (you’ll write core code).</li> <li data-renderer-start-pos="4068" data-local-id="a4b885fc11b4">Working knowledge of <strong data-renderer-mark="true">Python and/or Go</strong> (you don’t have to be expert, but should be comfortable navigating them).</li> <li data-renderer-start-pos="4183" data-local-id="895f7b8fff8b">Experience developing and/or operating <strong data-renderer-mark="true">high-load, distributed services</strong>.</li> <li data-renderer-start-pos="4258" data-local-id="59c70dbd2e4a">Production mindset: ability to <strong data-renderer-mark="true">use SSH, read logs/metrics/traces</strong>, and debug distributed systems behavior.</li> <li data-renderer-start-pos="4367" data-local-id="5f999bb71d56">Solid CS fundamentals: algorithms, data structures, concurrency basics.</li> </ul> <h4 id="Nice-to-have" data-local-id="e68d74a84d84" data-renderer-start-pos="4442">Nice to have</h4> <ul> <li data-renderer-start-pos="4458" data-local-id="5ed245f3385c">Experience with Big Data systems (YTsaurus/Hadoop/Spark/ClickHouse/Kafka-like ecosystems).</li> <li data-renderer-start-pos="4552" data-local-id="d3a1ef32bb22">Experience with multi-tenant platforms, schedulers, resource isolation, quotas, and reliability engineering.</li> <li data-renderer-start-pos="4664" data-local-id="9da3b43ec835">Strong performance engineering skills (profiling, lock contention, latency/throughput tradeoffs).</li> </ul> <p><span data-ccp-props="{}"><em data-stringify-type="italic">We conduct coding interviews as part of the process.</em></span></p><div class="content-conclusion"><p><strong>Benefits &amp; Perks:</strong></p> <ul> <li>Competitive compensation</li> <li>Career growth and learning opportunities</li> <li>Flexibility and work-life balance</li> <li>Collaborative and innovative culture</li> <li>Opportunity to work on impactful AI projects</li> <li>International environment and talented teams</li> </ul> <p><strong>What's it like to work at Nebius:</strong></p> <p>Fast moving&nbsp;- Bold thinking&nbsp;- Constant growth&nbsp;- Meaningful impact&nbsp;- Trust and real ownership&nbsp;- Opportunity to shape the future of AI&nbsp;</p> <p><strong>Equal Opportunity Statement:</strong></p> <p>Nebius is an equal opportunity employer. We are committed to fostering an inclusive and diverse workplace and to providing equal employment opportunities in all aspects of employment. We do not discriminate on the basis of race, color, religion, sex (including pregnancy), national origin, ancestry, age, disability, genetic information, marital status, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable law.</p> <p>Applicants must be authorized to work in the country in which they apply and will be required to provide proof of employment eligibility as a condition of hire.&nbsp;</p> <p>If you need accommodations during the application process, please let us know.</p></div>