We are seeking a Senior Software Engineer to maintain, optimize, and troubleshoot our existing systems built on Kubernetes, microservices architecture, Kafka,Go, Python, event-driven systems, data orchestration frameworks, and scalable
datastores such as Elasticsearch and Apache Pinot.This role focuses on system stability, performance optimization, and issue resolution, rather than building new systems from scratch. You'll ensure our infrastructure and services remain reliable, secure, and efficient, while proactively identifying and addressing potential risks and bottlenecks. The ideal candidate is a self-sufficient engineer with strong troubleshooting skills, capable of making informed technical decisions and collaborating across teams when needed.
Key Domains
*System Maintenance & Reliability: Ensure uptime, performance, and smooth operation of critical systems.
*Incident Management: Troubleshoot, diagnose, and resolve production issues effectively.
*Event-Driven Systems: Maintain and optimize Kafka pipelines.
*Data Orchestration: Monitor and improve workflows with tools like Prefect or Flyte.
*Search and Analytics Datastores: Maintain and fine-tune Elasticsearch and Apache Pinot clusters.
*Infrastructure Management: Manage Kubernetes deployments, scaling, and operational health.
Responsibilities
*Monitor, troubleshoot, and resolve issues across Kubernetes-based microservices.
*Maintain and optimize Kafka-based event pipelines for reliability and performance.
*Manage and fine-tune Elasticsearch and Apache Pinot clusters for search and analytics workloads.
*Oversee and optimize data orchestration workflows (e.g., Prefect, Flyte).
*Perform root-cause analysis for incidents and implement preventative measures.
*Ensure infrastructure stability and scalability within Kubernetes environments.
*Collaborate with cross-functional teams to address technical debt and operational challenges.
*Review and improve CI/CD pipelines for deployment reliability.
*Document processes, operational runbooks, and troubleshooting guides.
*Proactively identify risks, inefficiencies, and areas for improvement.
Requirements
*5+ years of professional software engineering experience, with significant time spent in system maintenance or reliability-focused roles.
*Proficiency in Go and Python programming languages.
*Strong experience with Kubernetes for container orchestration and management.
*Hands-on experience with Kafka and event-driven architectures.
*Familiarity with Elasticsearch and Apache Pinot for search and analytics.
*Experience with data orchestration tools (e.g., Prefect, Flyte, Airflow).
*Strong understanding of distributed systems design principles and pub-sub patterns.
*Proven track record of troubleshooting complex production issues and implementing long-term fixes.
*Ability to work independently with minimal oversight and prioritize tasks effectively.
*Clear and concise communication skills, including documentation practices.
Bonus skills
*Experience with observability tools (e.g., Prometheus, Grafana, Datadog).
*Familiarity with IaC tools (e.g., Terraform, Helm).
*Exposure to cloud platforms (AWS, GCP, Azure).
*Previous experience managing legacy systems and technical debt resolution.