Remote – Americas
Full-time
We are recruiting on behalf of a fast-growing AI infrastructure company that builds a high-performance vector database powering semantic search, RAG pipelines, AI agents, and large-scale machine learning applications.
We are seeking a Senior Site Reliability Engineer (SRE) to join the Cloud Operations team and help ensure reliability, observability, and operational excellence across production cloud environments.
This role is highly operations-focused and ideal for engineers who enjoy owning system reliability, improving automation, and operating large-scale distributed systems in production.
As a Senior SRE, you will be responsible for maintaining and improving production infrastructure while reducing operational risk and improving system reliability at scale.
You will work closely with platform engineering and infrastructure teams to ensure systems remain secure, performant, and highly available as customer usage grows.
Remote – Americas (North, Central, or South America)
Candidates must be able to work primarily within American time zones
Cloud Infrastructure & Operations
Operate and maintain production cloud infrastructure at scale
Manage Kubernetes clusters, networking, and deployment pipelines
Improve reliability, performance, and security of production systems
Monitoring & Observability
Enhance monitoring, logging, and alerting systems
Improve operational visibility and incident detection
Incident Response & Reliability
Lead incident response and root cause analysis
Implement preventive measures and continuous reliability improvements
Participate in on-call rotations
Automation & Process Improvement
Reduce operational toil through automation and tooling
Maintain and improve runbooks and operational procedures
Collaboration
Work closely with platform engineering and infrastructure teams
Support scalable architecture and operational best practices
5+ years of experience in DevOps, SRE, or infrastructure operations
Strong hands-on experience running Kubernetes in production
Solid understanding of:
Linux systems
Networking fundamentals
Cloud infrastructure (AWS, GCP, or Azure)
Experience with monitoring, alerting, and incident management
Experience with infrastructure automation or infrastructure-as-code
Comfortable participating in on-call rotations
Strong communication and problem-solving skills
Experience with Terraform or similar IaC tools
Familiarity with Prometheus, Grafana, Loki, or OpenTelemetry
Scripting experience in Python, Bash, or Go
Experience in SaaS, cloud platforms, or data infrastructure environments
Exposure to security, compliance, or system hardening
Competitive compensation and benefits
Fully remote work environment
Flexible working hours
Opportunity to work on mission-critical cloud infrastructure
Collaborative, engineering-driven culture
If you are passionate about reliability engineering, cloud infrastructure, and large-scale distributed systems, we would love to hear from you.