Job Description

Position Overview

As a Staff Site Reliability Engineer (SRE), you will be responsible for developing and implementing highly reliable and scalable system. You will work closely with different functional teams to create a stable, efficient, and scalable environment, leading complex projects requiring collaboration with multiple stakeholders.

Job Responsibilities

Define and enforce SRE best practices and standards.

Architect and implement highly reliable and scalable systems.

Lead complex post-incident reviews and implement systemic improvements.

Collaborate with product and engineering teams to set reliability targets.

Manage high-impact incidents and coordinate incident response.

Contribute to budget planning and resource allocation.

Lead efforts to establish disaster recovery strategies.

Provide technical leadership and mentorship to the SRE team.

Continuously track and improve metrics (for example, DORA) to optimize software delivery and operational performance.

Participate in on-call rotation.

Other duties as assigned

Required Qualifications

8-10 years of experience in similar or related role

Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience)

In-depth knowledge of Cloud Ops technologies including Amazon Web Services (AWS) and Terraform or other Infrastructure as Code (IaC)

Advanced knowledge in Linux operating systems and troubleshooting OS issues

Expertise in setting up and managing monitoring tools (such as Prometheus, Grafana, Datadog, Nagios, Open Telemetry, ELK, or similar tools)

In-depth understanding of monitoring and alerting systems, networking principles (such as load balancing, CDN, and disaster recovery)

Strong understanding of:

Incident management

Capacity planning

Disaster recovery

Observability practices (in tools such as OpenTelemetry and Jaeger)

Advanced experience with or knowledge of with security measures and practices (for example, threat modeling, compliance, and secure coding practices)

Strong analytical and problem-solving skills

Knowledge with Linux systems and common system administration tasks

Strong understanding of programming/scripting languages (such as Python) including additional scripting skills in multiple languages to automate SRE operations

Excellent communication and teamwork skills

A willingness to learn and adapt in a fast-paced, dynamic environment

Preferred Qualifications

Familiarity with DevOps practices, infrastructure as Code tools, and Agile methodologies a plus

Valstro

Site Reliability Engineer (SRE)

Who are we? Valstro is a recent (mid-2021) FinTech partnership working to deliver next-gen, Cloud-First, trading solutions to global, multi-asset-class institutional clie;

admin
engineer

Blackpoint 20cyber

Sr Site Reliability Engineer

Blackpoint Cyber is the leading provider of world-class cybersecurity threat hunting, detection and remediation technology. Founded by former National Security Agency (NS;

admin
engineer

Faptic Technology

Azure Site Reliability Engineer

Faptic Technology is a leading provider of IT consulting and managed services, specializing in Azure cloud solutions, software development, and site reliability engineeri;

admin
engineer

Customer Io

Site Reliability Engineering Manager

Hi, I’m Terry, Director of Engineering at Customer.io. I’m looking for someone to manage the global site reliability engineering (SRE) squad responsible for reliability a;

engineer
exec
admin

Staff Site Reliability Engineer

Job Description

USA Only

SRE Engineer

10 days ago