Job Description

Are you excited by the challenge of managing large-scale systems, automating infrastructure, and ensuring seamless service reliability? We’re seeking a Site Reliability Engineer (SRE) to play a key role in shaping the future of our global infrastructure.

Overseeing a global infrastructure of ~10,000 on-prem servers, you’ll tackle unique technical challenges, engineer scalable systems, and have a direct impact on the reliability and performance of our products.

Main Responsibilities

Deliver projects on time: Plan, delegate, execute, and oversee key projects;

Collaborate: Work closely with stakeholders and other teams. Mentor colleagues and lead knowledge transfer;

Ensure quality and reduce technical debt: Deliver solutions with solid design and address blockers, toil, and debt to keep systems healthy;

Drive engineering excellence: Aim for quality and choose the right solution for the problems we face;

Protect solution quality: Ensure designs are implemented with proper quality and minimal tech debt;

Data‑backed decisions: Help teams and stakeholders navigate data and act on insights;

Design and maintain highly available, scalable infrastructure with monitoring, alerting, and anomaly detection;

Automate everything: Create and optimize automation to streamline deployments, improve speed, and cut manual work;

Solve complex issues: Troubleshoot, debug, and resolve critical issues in complex systems;

Use AI: Integrate AI into workflows and processes to speed up delivery and reduce toil.

Core Requirements

Observability: Experience with monitoring tools and frameworks to ensure system observability (OpenSearch, VictoriaMetrics, Prometheus, Thanos, Mimir, OpenTelemetry, Nagios);

Databases and storage systems: Experience operating highly available SQL, NoSQL databases, and object stores at scale (MySQL, Percona, PostgreSQL, Cassandra, ClickHouse, Timescale, Druid, MinIO);

Data visualization: Ability to build meaningful dashboards that show the right insights (Grafana, OpenSearch Dashboards);

Alerting and anomaly detection: Ability to build anomaly detection and alerting pipelines;

Programming: Proficiency in one or more programming languages for automation scripts and integrations (Python, Go, Rust, C);

Linux: Strong knowledge of Linux systems, especially Debian‑based distributions;

Workflow: Ability to use workflow automation frameworks (Airflow, Prefect, n8n);

Configuration management: Ability to design and develop configuration management codebases and deployment pipelines (SaltStack, Ansible, Rundeck);

Networking: Strong understanding of networking protocols and concepts (Overlay, VPN, Proxy, DNS, HTTP, SSL, TCP, UDP);

Security: Ability to design secure systems and working knowledge of security concepts and tools (Vault, PKI, mTLS).

Salary Range

Gross Salary 23300 - 34000 PLN/Month

Senior Site Reliability Engineer

Senior Site Reliability Engineer (Enterprise Platform)Location: Remote - US - Open to Europe if happy to overlap with ESTCompensation: CompetitiveWe are a high-growth sof

Senior
engineer
admin

Manager, Site Reliability Engineering

Veeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilien

engineer
admin
exec