Staff Site Reliability Engineer

Job Description

Position Overview

As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable system. You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.


Job Responsibilities
  • Define and enforce SRE best practices and standards.
  • Architect and implement highly reliable and scalable systems.
  • Lead complex post-incident reviews and implement systemic improvements.
  • Collaborate with product and engineering teams to set reliability targets.
  • Manage high-impact incidents and coordinate incident response.
  • Contribute to budget planning and resource allocation.
  • Lead efforts to establish disaster recovery strategies.
  • Provide technical leadership and mentorship to the SRE team.
  • Continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.
  • Participate in on-call rotation.
  • Other duties as assigned

  • Required Qualifications
  • 8-10 years of experience in similar or related role
  • Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience)
  • In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC)
  • Advanced knowledge in Linux operating systems and troubleshooting OS issues
  • Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools)
  • In-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery)
  • Strong understanding of:
  • Incident management
  • Capacity planning
  • Disaster recovery
  • Observability practices (in tools such as OpenTelemetry and Jaeger)
  • Advanced experience with or knowledge of with security measures and practices (for example, threat modeling, compliance, and secure coding practices)
  • Strong analytical and problem-solving skills
  • Knowledge with Linux systems and common system administration tasks
  • Strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations
  • Excellent communication and teamwork skills
  • A willingness to learn and adapt in a fast-paced, dynamic environment

  • Preferred Qualifications
  • Familiarity with DevOps practices, infrastructure as Code tools, and Agile methodologies a plus