Weekday Ai

Staff Engineer - DevOps

  • Weekday Ai

Job Description

This role is for one of the Weekday's clients

Min Experience: 9 years

Location: Remote (India)

JobType: full-time

As a Staff Engineer, you will architect and evolve our DevOps ecosystem, champion cloud cost governance, and implement best-in-class container orchestration practices. You will work cross-functionally with engineering, security, and finance teams to ensure operational excellence while proactively managing infrastructure spend.

Requirements

Key Responsibilities

DevOps Leadership & Architecture

  • Lead end-to-end DevOps strategy, including CI/CD pipelines, automation, infrastructure-as-code, and release engineering.
  • Design scalable, resilient cloud-native architectures aligned with business growth.
  • Establish DevOps best practices, reliability standards, and operational governance.

Kubernetes & Containerization

  • Architect and manage large-scale Kubernetes environments for production workloads.
  • Optimize workloads across clusters for performance, reliability, and cost efficiency.
  • Build and maintain containerized applications using Docker and Kubernetes, ensuring portability and scalability.
  • Drive multi-cluster, multi-region deployments where necessary.

Cost Savings & Cost Planning

  • Own infrastructure cost visibility and optimization initiatives.
  • Implement cloud cost-saving strategies including rightsizing, reserved capacity planning, auto-scaling optimization, and workload scheduling.
  • Partner with finance teams for budgeting, forecasting, and cost planning.
  • Create dashboards and reporting mechanisms to track infrastructure ROI and spend trends.
  • Continuously identify inefficiencies and implement measurable cost-reduction initiatives without compromising performance.

Monitoring & Observability

  • Design and implement comprehensive monitoring systems using Grafana and related observability tools.
  • Build real-time dashboards for system health, performance metrics, and cost insights.
  • Establish alerting frameworks to minimize downtime and improve incident response.
  • Drive improvements in system reliability through data-driven monitoring and post-incident analysis.

Automation & Reliability

  • Automate provisioning, deployments, scaling, and recovery processes.
  • Improve system resilience, availability, and disaster recovery strategies.
  • Lead root cause analysis for major incidents and implement preventive measures.

Required Qualifications

  • 9–15 years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles.
  • Deep expertise in Kubernetes, container orchestration, and production-grade Docker and Kubernetes implementations.
  • Strong hands-on experience with Grafana, monitoring systems, and observability frameworks.
  • Proven track record in cost savings initiatives and infrastructure cost planning in cloud environments.
  • Experience designing highly available, scalable systems in AWS, Azure, or GCP.
  • Strong understanding of Infrastructure-as-Code (Terraform, CloudFormation, etc.).
  • Expertise in CI/CD automation and release management.
  • Solid knowledge of networking, security best practices, and cloud architecture patterns.

Preferred Attributes

  • Experience managing large-scale production environments with strict SLAs.
  • Strong analytical skills with the ability to translate technical metrics into financial impact.
  • Leadership mindset with experience mentoring engineers and influencing cross-functional teams.
  • Excellent communication and stakeholder management skills.