Are you excited by the challenge of managing large-scale systems, automating infrastructure, and ensuring seamless service reliability? We’re seeking a Site Reliability Engineer (SRE) to play a key role in shaping the future of our global infrastructure.
Overseeing a global infrastructure of ~10,000 on-prem servers, you’ll tackle unique technical challenges, engineer scalable systems, and have a direct impact on the reliability and performance of our products.
Main Responsibilities
Deliver projects on time: Plan, delegate, execute, and oversee key projects;
Collaborate: Work closely with stakeholders and other teams. Mentor colleagues and lead knowledge transfer;
Ensure quality and reduce technical debt: Deliver solutions with solid design and address blockers, toil, and debt to keep systems healthy;
Drive engineering excellence: Aim for quality and choose the right solution for the problems we face;
Protect solution quality: Ensure designs are implemented with proper quality and minimal tech debt;
Data‑backed decisions: Help teams and stakeholders navigate data and act on insights;
Design and maintain highly available, scalable infrastructure with monitoring, alerting, and anomaly detection;
Automate everything: Create and optimize automation to streamline deployments, improve speed, and cut manual work;
Solve complex issues: Troubleshoot, debug, and resolve critical issues in complex systems;
Use AI: Integrate AI into workflows and processes to speed up delivery and reduce toil.
Core Requirements
Observability: Experience with monitoring tools and frameworks to ensure system observability (OpenSearch, VictoriaMetrics, Prometheus, Thanos, Mimir, OpenTelemetry, Nagios);
Databases and storage systems: Experience operating highly available SQL, NoSQL databases, and object stores at scale (MySQL, Percona, PostgreSQL, Cassandra, ClickHouse, Timescale, Druid, MinIO);
Data visualization: Ability to build meaningful dashboards that show the right insights (Grafana, OpenSearch Dashboards);
Alerting and anomaly detection: Ability to build anomaly detection and alerting pipelines;
Programming: Proficiency in one or more programming languages for automation scripts and integrations (Python, Go, Rust, C);
Linux: Strong knowledge of Linux systems, especially Debian‑based distributions;
Workflow: Ability to use workflow automation frameworks (Airflow, Prefect, n8n);
Configuration management: Ability to design and develop configuration management codebases and deployment pipelines (SaltStack, Ansible, Rundeck);
Networking: Strong understanding of networking protocols and concepts (Overlay, VPN, Proxy, DNS, HTTP, SSL, TCP, UDP);
Security: Ability to design secure systems and working knowledge of security concepts and tools (Vault, PKI, mTLS).