Job Description

Arcadia is dedicated to happier, healthier days for all. We believe that there is a better healthcare world – one powered by data. Our platform transforms complex, diverse data into a unified foundation for health, helping organizations deliver better care, boost revenue, and lower costs.

We’re a team of fiercely driven individuals committed to making healthcare more sustainable—and we’re looking for passionate people to help us get there.

For more information, visit arcadia.io

Why This Role is Important to Arcadia

Love building reliable systems, and want to make a difference?

Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

What Success Looks Like

In 3 months

Build deep context on Arcadia’s platform, production risks, and operational practices. Participate in on-call/incident response and quickly improve signal quality for at least one critical domain (dashboards, alerts, traces, runbooks). Identify a high-leverage reliability initiative and align stakeholders on scope, success metrics, and milestones.

In 6 months

Establish SLOs/error budgets for key customer journeys, drive operational readiness standards for launches, and lead remediation for recurring incidents with measurable reductions in customer impact and MTTR. Deliver major toil-reduction improvements via automation and self-service workflows.

In 12 months

Own and execute a reliability program with cross-org impact (e.g., GitOps delivery guardrails, observability platform evolution, resilience/DR improvements, or secure infrastructure controls). Influence architecture decisions, establish org-wide operational standards, and mentor Staff engineers—raising the reliability and security bar across Arcadia.

What You'll Be Doing

Act as the technical leader for reliability for one or more domains; set direction and standards while remaining hands-on where it matters most

Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes

Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation

Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)

Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk

Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams

Lead operational readiness and reliability reviews for new features/architectural changes; reinforce non-functional requirements (availability, latency, security, cost)

Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services

Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)

Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation; raise reliability standards across teams

What You'll Bring

8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale

Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations

Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails

Strong GitOps experience with Argo CD; experience building delivery workflows and automation using Argo Workflows

Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform; ability to define reusable platform patterns and controls

Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems

Proficiency in Python for building automation, tooling, and reliability improvements

Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)

Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders; produces high-quality docs/runbooks

Would Love For You To Have

Experience with ScyllaDB or similar distributed databases (e.g., Cassandra) and their reliability/performance characteristics

Experience with Spark or data processing platforms, including reliability and cost considerations for large-scale workloads

Familiarity with agentic coding practices and principles (safe automation, reviewable changes, guardrail-first workflows)

Strong infrastructure security knowledge: threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability

Principal Engineer Competencies

Customer Focus: champions customer impact; drives SLO definition with product partners; participates in incidents to limit customer impact; may engage customers to understand problems

Technical Leadership: leading cross-team technical representative; negotiates interfaces; anticipates edge cases; designs telemetry for availability and reliability

Total Ownership: owns outcomes from requirements and design through production support; transitions complex changes with multi-phase rollouts and long-term ownership

Effective Communication: communicates to diverse audiences; finalizes key documentation (runbooks, guides, FAQs); synthesizes standards and best practices

Proactive Leadership: coaches senior/peer teams primarily through review; delegates appropriately; sets clear expectations (Definition of Done) and improves service processes/rotations

What You'll Get

Be a part of a mission driven company that is transforming the healthcare industry by changing the way patients receive care

A flexible, remote friendly company with personality and heart

Employee driven programs and initiatives for personal and professional development

Become a member of the talented, energized, diverse and purpose-driven Arcadian Community

Apply Now

Platform Engineer (SRE) - LATAM

About usOrderMesh is the operating system for modern print-on-demand fulfillment.We build intelligent order orchestration software that sits at the center of a complex gl

engineer

Senior Site Reliability Engineer

About This Role We deliver mission-critical IT/OT infrastructure—in cloud and on-prem—for industrial customers that can't afford downtime. Small team. Hard problems. Pr

Senior
admin
engineer

Senior DevOps / SRE Engineer

Senior DevOps / SRE EngineerLocation: Based in US to GMT timezonesRemote | Full-timeCompensation: $120K - $150KA confidential client operating at the intersection of dece

Senior
devops
engineer

Senior Site Reliability Engineer - B2B

About CookUnity: Food has lost its soul to modern convenience. And with it, it has lost the power to nourish, inspire, and connect us. So in 2018, CookUnity was founded a

admin
engineer

Principal Site Reliability Engineer

Job Description

Why This Role is Important to Arcadia

Principal Engineer Competencies

Remote USA

SRE Engineer

9 days ago

Head Level

Platform Engineer (SRE) - LATAM

Senior Site Reliability Engineer

Senior DevOps / SRE Engineer

Senior Site Reliability Engineer - B2B

Find Remote Jobs

About us

Additional

Principal Site Reliability Engineer

Job Description

Why This Role is Important to Arcadia

Principal Engineer Competencies

Remote USA

SRE Engineer

9 days ago

Head Level

Platform Engineer (SRE) - LATAM

Senior Site Reliability Engineer

Senior DevOps / SRE Engineer

Senior Site Reliability Engineer - B2B

Subscribe to Job Alerts

Find Remote Jobs

About us

Additional