Design, implement, and maintain infrastructure-as-code using Terraform; contribute to shared module libraries and enforce IaC standards across the team;
Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments;
Operate and maintain observability infrastructure including Grafana, alerts, dashboards, and log pipelines. Act to eliminate noise and surface signal;
Contribute to pipeline reliability: identify flaky stages, reduce build times, improve developer experience across CI/CD pipelines;
Remediate security vulnerabilities (CVEs) in container images and infrastructure components; participate in compliance work including FedRAMP support activities;
Develop and maintain runbooks, change management procedures, and operational documentation;
Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST;
Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap;
Participate in on-call incident response rotation; lead or support incident command during active production incidents including root cause analysis and post-incident review.
5+ years of industry experience with a trajectory that demonstrates growing depth in cloud infrastructure and SRE practices;
Managed production Kubernetes environments at scale: not just deployed workloads, but owned cluster health, upgrades, and failure modes;
Responded to production incidents in high-stakes environments where downtime has real consequences;
Written and maintained Terraform at the module level, not just as a consumer: understands state, dependencies, and the operational burden of drift;
Operated in an environment that uses GitOps: has a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent;
Balanced reactive operational work with proactive roadmap delivery; knows how to protect time for improvements while keeping production stable;
Worked with observability as a first-class discipline: built meaningful dashboards, eliminated alert fatigue, and used metrics to make operational decisions;
Contributed to security hardening in a regulated or compliance-adjacent environment: FedRAMP, SOC 2, or similar frameworks are a strong asset.