As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable system. You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.
Job Responsibilities
Define and enforce SRE best practices and standards.
Architect and implement highly reliable and scalable systems.
Lead complex post-incident reviews and implement systemic improvements.
Collaborate with product and engineering teams to set reliability targets.
Manage high-impact incidents and coordinate incident response.
Contribute to budget planning and resource allocation.
Lead efforts to establish disaster recovery strategies.
Provide technical leadership and mentorship to the SRE team.
Continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.
Participate in on-call rotation.
Other duties as assigned
Required Qualifications
8-10 years of experience in similar or related role
Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience)
In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC)
Advanced knowledge in Linux operating systems and troubleshooting OS issues
Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools)
In-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery)
Strong understanding of:
Incident management
Capacity planning
Disaster recovery
Observability practices (in tools such as OpenTelemetry and Jaeger)
Advanced experience with or knowledge of with security measures and practices (for example, threat modeling, compliance, and secure coding practices)
Strong analytical and problem-solving skills
Knowledge with Linux systems and common system administration tasks
Strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations
Excellent communication and teamwork skills
A willingness to learn and adapt in a fast-paced, dynamic environment
Preferred Qualifications
Familiarity with DevOps practices, infrastructure as Code tools, and Agile methodologies a plus