Rackspace

Sr Systems Engineer HPC

Job Description

Job Summary: Rackspace seeking a highly skilled and motivated HPC System Engineer to join our team. You’ll be responsible for working directly for one of flagship clients and designing, implementing, maintaining, and optimizing their high-performance computing (HPC) infrastructure. You will work closely with researchers, scientists, and other engineers to ensure the efficient and reliable operation of the HPC systems. 

Work Location: 100% Remote. Due to this role supporting a customer in the Seattle area we prefer to hire in either PST or CST time zones.
 
Travel: There may be minimal travel to either San Antonio, TX or Seattle WA. 


Responsibilities:
  • Install, configure, and maintain HPC clusters, including hardware and software components.
  • Monitor system performance, identify bottlenecks, and implement solutions to optimize performance.
  • Manage user accounts, permissions, and resource allocation.
  • Perform regular system maintenance, updates, and patching.
  • Troubleshoot and resolve hardware and software issues in a timely manner.
  • Participate in the design and planning of HPC infrastructure upgrades and expansions.
  • Evaluate and recommend hardware and software solutions to meet evolving computational needs.
  • Implement and manage storage systems, networking infrastructure, and interconnects (e.g., InfiniBand).
  • Optimize system configurations and application performance for HPC workloads.
  • Profile and analyze application performance to identify areas for improvement.
  • Implement and utilize performance monitoring tools and techniques.
  • Provide technical support and training to HPC users.
  • Collaborate with researchers and scientists to understand their computational requirements.
  • Work closely with HPC architects and engineers to ensure that research needs are met.
  • Document system configurations, procedures, and best practices.
  • Assist HPC engineers and architects with day-to-day operations and ticket management.
  • Implement and maintain security measures to protect HPC infrastructure and data.
  • Ensure compliance with relevant security policies and regulations.
  • Manage data backups and disaster recovery procedures.

  • Qualifications:
  • Bachelor's degree in computer science, engineering, or a related field.  Experience may substitute for the degree.
  • Minimum of 10 yrs experience working with systems; 5yrs specifically with HPC.
  • Strong knowledge of Linux operating systems (e.g., Rocky, Ubuntu).
  • Experience with cluster management tools (e.g., Slurm, PBS).
  • Familiarity with high-speed interconnects (e.g., InfiniBand, Ethernet).
  • Knowledge of parallel file systems (e.g., Lustre, SEPH, GPFS).
  • Proficiency in scripting languages (e.g., R, Python, Bash).
  • Understanding of HPC hardware architectures and technologies (e.g., CPUs, GPUs, memory).
  • Strong demonstrated experience with a major configuration management software (e.g. Terraform, Ansible), including application packaging and installation.
  • Must have strong knowledge of Linux security and Linux shell scripting.
  • Strong communication and interpersonal skills.
  • Knowledge of data transfer protocols and large-scale storage solutions.