SRE Lead – DBaaS Platform at Tessell

HyderabadFull-timeCustomer SuccessPosted 3 months ago

About the Role

Job Title: SRE Lead – DBaaS Platform Role Overview We are seeking an experienced Site Reliability Engineering (SRE) Lead to strengthen production reliability ownership for our Database-as-a-Service (DBaaS) platform. This role will bring hyperscaler-grade (RDS-level) operational expertise to drive deep product debugging, reliability engineering, and Dev collaboration across cloud-native database services. The SRE Lead will own platform stability, availability, performance, and incident excellence across Azure/AWS/GCP-hosted database workloads. Location :- Hyderabad Department :- Customer Success Reporting :- Senior Director Customer Success/SRE Key Responsibilities 1. Production Reliability Ownership  Own end-to-end reliability, availability, and performance of the DBaaS platform.  Define and enforce SLIs, SLOs, and SLAs across all supported database engines.  Lead production incident response (P1/P2), RCAs, and long-term resilience improvements.  Drive error budget governance with Engineering and Product teams. 2. Hyperscaler-Level Operational Excellence  Bring RDS/Cloud SQL/Azure SQL Managed Instance operational patterns into the platform.  Implement automation-first operations (self-healing, auto-remediation, failover orchestration).  Standardize HA/DR architectures across multi-region deployments.  Improve backup reliability, replication integrity, and failover predictability. 3. Deep Product Debugging &amp; Dev Collaboration  Partner with Product Engineering for deep database engine-level debugging.  Troubleshoot complex performance bottlenecks (IO, CPU, locking, replication lag).  Support root cause analysis involving cloud infrastructure, storage, networking, and database internals.  Influence platform architecture for operability and reliability. 4. Observability &amp; Reliability Engineering  Build unified observability across DBaaS (metrics, logs, traces).  Define golden signals for database reliability.  Improve proactive anomaly detection and capacity forecasting.  Drive chaos testing and resilience validation practices. 5. Automation &amp; Platform Hardening  Lead reliability automation (runbooks → code).  Improve provisioning, patching, upgrade, and scaling reliability.  Standardize configuration management and drift detection.  Enhance security posture aligned to enterprise compliance needs. 6. DevOps &amp; Platform Governance  Champion SRE best practices across engineering teams.  Establish production readiness review frameworks.  Define release reliability gates for DBaaS components.  Mentor junior SREs and build a reliability-first culture. Technical Requirements Cloud Platforms (Mandatory – Multi-Cloud Preferred)  Deep hands-on experience with: o AWS RDS / Aurora o Azure SQL MI / Azure Database Services o GCP Cloud SQL / AlloyDB  Strong understanding of cloud networking, storage, IAM, HA architectures. Database Expertise  Strong operational knowledge of: o Oracle o PostgreSQL o MySQL o SQL Server  Experience handling large-scale production databases (TB+ workloads).  Performance tuning, replication troubleshooting, and backup recovery validation. SRE &amp; Platform Skills  Strong scripting: Python / Bash / Go.  Infrastructure as Code (Terraform / ARM / CloudFormation).  CI/CD pipelines and release automation.  Observability stack (Prometheus, Grafana, ELK, Datadog, etc.).  Kubernetes exposure preferred. Leadership Expectations  10+ years overall experience, 5+ in SRE/Platform roles.  Prior experience in hyperscaler environments or cloud-native SaaS products.  Strong incident leadership and executive communication skills.  Ability to influence cross-functional stakeholders.  Experience building and leading SRE teams preferred. Success Metrics (First 12 Months)  Reduction in P1/P2 incidents by X%.  Improved MTTR by X%.  Defined SLO framework implemented across all DBaaS services.  Automation coverage &gt;70% of repeat operational tasks.  Zero critical audit non-compliance findings. Why Join Us  Opportunity to build hyperscaler-grade DBaaS reliability.  Direct impact on mission-critical enterprise workloads.  Multi-cloud platform engineering exposure.  High visibility role working with Product, Engineering, and Leadership.

About the Role

About the Role

Related Roles

About the Role

Related Roles