Site Reliability Engineer - (Remote - Canada)

Job Description

Jobgether has ALL remote jobs globally. We match you to roles where you're most likely to succeed, and provide feedback on every application to help you learn. No more guesswork, application black holes, or recruiter ghosting in your job search.

For one of our clients, we are looking for a Site Reliability Engineer, working remotely from Canada.

As a Site Reliability Engineer (SRE), you will play a key role in designing, implementing, and maintaining scalable infrastructure while ensuring system reliability and efficiency. Your focus will be on automation, performance optimization, and cloud resource management. Collaborating with cross-functional teams, you will streamline CI/CD pipelines, enhance monitoring solutions, and support a highly available infrastructure. This position requires a proactive approach to troubleshooting and continuous improvement, ensuring seamless integration of new services while leveraging the latest SRE best practices.

Accountabilities:

  • Design, build, and maintain highly scalable cloud infrastructure using Terraform and Terragrunt for automated resource provisioning.
  • Manage and optimize AWS cloud environments, ensuring security, cost efficiency, and high availability.
  • Oversee data streaming platforms using Confluent Cloud and Kafka, ensuring reliable data pipelines.
  • Deploy and manage Redis instances for caching and real-time data processing.
  • Implement and maintain monitoring and alerting solutions using Prometheus, Grafana, Alert Manager, and OpsGenie.
  • Enable feature flag management and controlled rollouts with LaunchDarkly.
  • Manage Kubernetes clusters, utilizing Helm, ArgoCD, Istio, and Kustomize for continuous deployment and infrastructure-as-code practices.
  • Collaborate with development teams to integrate new services into the infrastructure seamlessly.
  • Troubleshoot complex system issues to maintain high availability and performance.
  • Continuously improve automation tools, processes, and methodologies to enhance system scalability.

Requirements

  • 4+ years of experience in Site Reliability Engineering or a similar role with a strong focus on cloud infrastructure.
  • Expertise in Infrastructure as Code (IaC) using Terraform and Terragrunt.
  • Deep knowledge of AWS cloud services and best practices for scalable and secure architectures.
  • Hands-on experience with Confluent Cloud and Kafka for distributed data streaming.
  • Strong experience with Redis for caching and RDS for data storage.
  • Proficiency with OpenSearch/ElasticSearch/ChaosSearch for search and analytics.
  • Advanced knowledge of monitoring tools like Prometheus, Grafana, Alert Manager, and OpsGenie.
  • Experience with LaunchDarkly for feature flag management.
  • Extensive experience managing Kubernetes clusters, including Helm for package management, ArgoCD for deployments, and Istio for service mesh configurations.
  • Familiarity with Kustomize for Kubernetes resource configuration.
  • Strong problem-solving skills and ability to troubleshoot complex systems in production environments.
  • Excellent communication and collaboration skills within agile teams.

Nice to Have:

  • Experience working in multi-cloud environments (AWS, GCP, Azure).
  • Familiarity with security best practices in cloud and containerized environments.
  • Knowledge of serverless architectures and CI/CD tools like Jenkins and GitHub Actions.
  • Some development experience with NodeJS, Python, or GoLang.

Benefits

  • Competitive salary based on experience and qualifications.
  • Fully remote work flexibility, with a collaborative team environment.
  • Comprehensive healthcare coverage, including medical, dental, and vision plans.
  • Retirement savings plan with company matching.
  • Flexible paid time off (PTO) to support work-life balance.
  • Professional development opportunities, including training and certifications.
  • Access to cutting-edge technology and opportunities to work on innovative projects.

#LI-CL1