Site Reliability Engineer (NOC)

Job Description

We have an opening for a Site Reliability Engineer to join our team. You will be responsible for maintaining the reliability and uptime of critical services, with a focus on Kubernetes administration, CentOS servers, Java application support, incident management, and change management

The ideal candidate will possess strong ArgoCD for Kubernetes management, Linux skills, basic scripting knowledge, and familiarity with modern monitoring, alerting, and automation tools. We are looking for someone who is self-motivated, possesses excellent communication skills (both oral and written), and can work both independently and collaboratively.  

*The hours for this role are 1:00 to 9:00 PM ET 

What you’ll do:

  • Monitor, maintain, and manage applications on CentOS servers, ensuring high availability and performance
  • Conduct routine tasks for system and application maintenance. Follow SOP's to correct/prevent issues
  • Respond to and manage running incidents, including running post-mortem meetings, performing root cause analysis, and ensuring timely resolution
  • Monitor production systems, applications, and overall performance 
  • Using tools to detect abnormal behaviors in the software and, more importantly, collect information that helps developers understand what causes the problem
  • Conduct security checks
  • Run meetings with our business partners following in-place processes and procedures
  • Writing, updating and maintaining policy and procedure documents
  • Write scripts or code as necessary to develop tools and/or services in order to support the product
  • Learn from Post Mortems and prevent new incidents from occurring 
  • Performing admin work on various tools and applications such as JIRA and New Relic
  • Maintain Service-level objectives, specific and quantifiable goals related to maintaining the parameters set for our “Golden Metrics”

Who you are:

  • 5+ years of experience working in a SaaS and Cloud environment
  • Administration of Kubernetes clusters, including management of applications using ArgoCD
  • Linux scripting to automate routine tasks and improve operational efficiency is required
  • Experience with database systems like MySQL and DB2 is required to be successful in this role
  • Experience as a Linux (CentOS / RHEL) administrator is a must
  • Understanding of running change management procedures, experience running change management meetings, and enforce safe and compliant changes to production environments
  • Deep knowledge of on-call responsibilities and awareness of time management. Include maintaining On-call management tools such as xMatters software
  • Experience with managing deployments using Jenkins
  • Prior experience with  monitoring tools including New Relic, Splunk and Nagios
  • Experience with log aggregation tools like Splunk, Loki or Grafana

#LI-HK1 #LI-Remote