HPC Engineer/Architect
- Cloud
- New York
- Contract
About the job:
Title: HPC Engineer/Architect
Start Date: Immediate
Position Type: Contract
Location: New York, NY (HYBRID- 2 days a week, 3 days remote)
You Will!
- Support day-to-day operations of large-scale parallel file systems
- Deploy and Maintain Linux HPC infrastructure across multiple datacenters
- Assist HPC engineers and architects with day-to-day operations and tickets
You Have!
- Experience working in a large-scale research based HPC environment
- Proven experience working with distributed file storage solutions (i.e., GPFS)
- Experience with deploying and troubleshooting Linux Operating Systems (RHEL/CentOS)
- Experience with Scripting and Automation (Ansible, Python, Shell Scripting)
- Solid understanding of job schedulers (LSF/SLURM)
- Experience with GPU-based compute infrastructure (including CUDA)
Responsibilities:
- Design, architect and oversee implementation of Linux based HPC clusters and storage
- Deploy physical hardware using HPC deployment tools and configuration and orchestration tools (Ansible).
- Parallel file system (GPFS) performance tuning, monitoring and troubleshooting
- Perform systems benchmarking, and developing automated tests for the HPC environment, ensuring the reliability and efficiency of our computational infrastructure.
- Infiniband network maintenance and troubleshooting
- Automate and monitor the HPC user lifecycle process
- Slurm installation, configuration, performance tuning and troubleshooting
- Plan, design and implement a transition from the LSF scheduler to Slurm
- Manage the Slurm scheduler and translate Research policies into scheduler configurations
- Consult with faculty and students to develop research pipelines for use on the HPC cluster
- Develop and maintain user lifecycle software suite in Python, implement CI/CD pipeline
- Test and automate upgrades of critical system applications using Ansible and shell scripts.
- The ability to communicate effectively with clinicians, researchers, and other team members to develop technological solutions is key