HPC Engineer/Architect

  • Cloud
  • New York
  • Contract

About the job:
Title: HPC Engineer/Architect
Start Date: Immediate
Position Type: Contract
Location: New York, NY (HYBRID- 2 days a week, 3 days remote)
  
You Will!

  • Support day-to-day operations of large-scale parallel file systems
  • Deploy and Maintain Linux HPC infrastructure across multiple datacenters
  • Assist HPC engineers and architects with day-to-day operations and tickets
  
You Have!
  • Experience working in a large-scale research based HPC environment
  • Proven experience working with distributed file storage solutions (i.e., GPFS)
  • Experience with deploying and troubleshooting Linux Operating Systems (RHEL/CentOS)
  • Experience with Scripting and Automation (Ansible, Python, Shell Scripting)
  • Solid understanding of job schedulers (LSF/SLURM)
  • Experience with GPU-based compute infrastructure (including CUDA)
  
Responsibilities:
  • Design, architect and oversee implementation of Linux based HPC clusters and storage
  • Deploy physical hardware using HPC deployment tools and configuration and orchestration tools (Ansible).
  • Parallel file system (GPFS) performance tuning, monitoring and troubleshooting
  • Perform systems benchmarking, and developing automated tests for the HPC environment, ensuring the reliability and efficiency of our computational infrastructure.
  • Infiniband network maintenance and troubleshooting
  • Automate and monitor the HPC user lifecycle process
  • Slurm installation, configuration, performance tuning and troubleshooting
  • Plan, design and implement a transition from the LSF scheduler to Slurm
  • Manage the Slurm scheduler and translate Research policies into scheduler configurations
  • Consult with faculty and students to develop research pipelines for use on the HPC cluster
  • Develop and maintain user lifecycle software suite in Python, implement CI/CD pipeline
  • Test and automate upgrades of critical system applications using Ansible and shell scripts.
  • The ability to communicate effectively with clinicians, researchers, and other team members to develop technological solutions is key