We are seeking an experienced Senior DevOps engineer to join our dynamic and innovative team. As a DevOps engineer, you will play a key role in ensuring the reliability, availability, and performance of our systems and services. You will work closely with cross-functional teams to build and maintain a robust and scalable infrastructure while championing best practices for reliability, automation, performance optimization and monitoring and alerting.
You need advanced and/or fluent proficiency in English to communicate with these different teams and clients during the workday.
Key Responsibilities:
# Incident Response: Lead and participate in incident response efforts, managing critical incidents to resolution, conducting post-incident analyses, and implementing preventive measures.
# Performance Optimization: Identify and address performance bottlenecks, optimize system performance to meet service-level objectives (SLOs) with the team.
# Capacity Planning: Collaborate on capacity planning efforts, ensuring that systems can handle current and future growth, and participate in capacity forecasting and resource allocation.
# Automation: Develop and maintain infrastructure as code (IaC) using tools like Terraform and automate routine operational tasks to improve efficiency and reduce manual intervention.
# Monitoring and Alerting: Implement and enhance monitoring, alerting, and logging systems to proactively detect issues, conduct root cause analysis, and ensure system health.
# Collaboration: Collaborate with development, operations, and other teams to bridge the gap between development and production environments, and promote a culture of collaboration to improve automation, efficiency, delivery, and software quality.
# Documentation: Maintain detailed documentation of systems, processes, and configurations, and contribute to knowledge sharing within the team.
Must-Have Skills:
- Excellent communication skills, both written and verbal in English;
- Proven experience in a similar DevOps or SRE role, with a strong focus on incident response, performance optimization and automation;
- Proficiency in at least one programming language (e.g., Python, Go, Java) for scripting and automation tasks;
- Experience with cloud computing platforms (the client uses Azure) and containerization technologies (e.g., Docker, Kubernetes);
- In-depth knowledge of infrastructure as code (IaC) principles and tools;
- Strong expertise in implementing and managing monitoring and alerting solutions (e.g., Prometheus, Grafana, Datadog, ELK Stack);
- Excellent problem-solving and troubleshooting skills, with a deep understanding of system and network fundamentals;
- Experience with Gitlab and/or Bitbucket and continuous integration/continuous deployment (CI/CD) pipelines (Jenkins + Groovy).
Desirable Skills:
- Relevant certifications (e.g., Azure or AWS);
- Familiarity with microservices architecture and service mesh technologies;
- Experience with configuration management tools (e.g., Ansible, Puppet, Chef);
- Knowledge of database administration and optimization;
- Security best practices and experience with security tools and compliance;
- Strong communication skills and the ability to work collaboratively in a cross-functional environment;
- Prior experience mentoring or leading junior DevOps engineer or SRE team members.