Job Description

Are you passionate about building resilient, scalable systems and driving operational excellence? Orion Health is looking for an experienced and proactive Site Reliability Engineer (SRE) to join our Technology team. In this role, you will be responsible for ensuring the reliability, availability, performance, and scalability of our cloud infrastructure and healthcare platforms that support millions of users worldwide.

As a Site Reliability Engineer, you will work at the intersection of software engineering and operations, applying automation, observability, and reliability engineering practices to improve platform stability, reduce operational toil, and enable development teams to deliver high-quality solutions with confidence.

What You'll Be Doing

As a Site Reliability Engineer, you will play a critical role in maintaining and evolving Orion Health's cloud infrastructure and operational platforms. You will help define and implement reliability standards, improve system observability, automate operational processes, and lead efforts to enhance platform resilience.

You will:

Design, implement, and maintain reliable, scalable, and secure infrastructure that supports Orion Health's products and services.
Define and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure platform reliability and customer satisfaction.
Build and maintain observability solutions, including monitoring, logging, alerting, and tracing capabilities across cloud environments.
Participate in incident response activities, including troubleshooting, root cause analysis, remediation planning, and post-incident reviews.
Lead initiatives to reduce operational toil through automation, Infrastructure as Code (IaC), and self-service capabilities.
Collaborate closely with software engineering teams to improve application reliability, performance, and operational readiness.
Identify and eliminate reliability bottlenecks through performance tuning, capacity planning, and system optimisation.
Support infrastructure and platform upgrades, ensuring minimal disruption and maintaining service availability.
Conduct capacity forecasting and scalability planning to meet future business and customer demands.
Develop operational runbooks, standards, and best practices that improve system resilience and operational efficiency.
Champion reliability engineering principles and foster a culture of continuous improvement across teams.
Contribute to disaster recovery, business continuity, and platform resilience initiatives.

What You'll Bring to the Role

A passion for reliability engineering, automation, and scalable cloud technologies.
Strong analytical and problem-solving skills with a focus on operational excellence.
A proactive approach to identifying risks and preventing incidents before they impact customers.
Excellent communication skills and the ability to collaborate effectively with engineering, product, and operational teams.
The ability to balance reliability, performance, security, and delivery priorities in a fast-paced environment.
A continuous improvement mindset and commitment to learning emerging technologies and industry best practices.

Experience

To succeed in this role, you will ideally have:

3+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, Cloud Operations, or Infrastructure Engineering roles.
Experience supporting and operating production cloud environments.
Strong experience with cloud platforms such as AWS, Azure, or Google Cloud Platform.
Experience implementing Infrastructure as Code (IaC) using tools such as Terraform, Bicep, ARM, or CloudFormation.
Experience with containerisation and orchestration technologies such as Docker and Kubernetes.
Experience building and maintaining monitoring, logging, and observability solutions.
Experience managing production incidents and conducting root cause analysis.
Knowledge of CI/CD pipelines and modern software delivery practices.
Experience with automation and scripting using tools such as PowerShell, Bash, Python, or similar.
Understanding of networking, security, high availability, and disaster recovery principles.
Experience supporting highly available, customer-facing applications and services.

Skills

Site Reliability Engineering (SRE) practices and principles.
Cloud infrastructure administration and optimisation.
Infrastructure as Code (IaC).
Monitoring, observability, and alerting.
Incident management and post-incident analysis.
Capacity planning and performance optimisation.
Automation and operational efficiency improvement.
Kubernetes and container platform management.
CI/CD and release automation.
Technical troubleshooting and root cause analysis.
Documentation and operational process improvement.
Stakeholder engagement and cross-functional collaboration.

Characteristics

Customer Focused - Understands customer needs and proactively works to improve platform reliability, availability, and overall user experience.

Professionalism - Builds trust through technical expertise, accountability, integrity, and effective communication with stakeholders.

Communicator - Communicates clearly and effectively with both technical and non-technical audiences, ensuring transparency during operational activities and incidents.

Learning Mindset - Actively seeks opportunities to learn, innovate, and improve systems, processes, and personal capability.

Achiever - Takes ownership of outcomes, prioritises effectively, and delivers high-quality solutions that improve reliability and operational performance.

Team Player - Collaborates across engineering, product, infrastructure, and customer-facing teams to achieve shared objectives and drive successful outcomes.

Qualifications

Bachelor's Degree in Computer Science, Software Engineering, Information Technology, or a related discipline preferred.
Industry certifications in cloud platforms, Kubernetes, DevOps, or reliability engineering are advantageous.

A Little About Us

Orion Health's vision is to reimagine the healthcare experience for all. We empower individuals, unlock insights from data, and unify fragmented healthcare systems. With more than 30 years of innovation behind us, Orion Health delivers solutions that improve healthcare outcomes in over 12 countries, managing data for more than 100 million people globally.

We are at the forefront of digital health transformation, with AI and patient engagement driving our next phase of growth.

Senior Site Reliability Engineer

Runware is building high-performance infrastructure and products to power the worlds intelligence. Our platform enables developers and businesses to run fast, scalable in

Senior
admin
engineer