Job Type: Contract
Contract Length: 6-7 Month Contract (with potential for extension)
Target Start Date: January
Work Location/Structure: Remote (local to the Northeast or Midwest preferred)
About the Opportunity:
Our client, a leader in Academic Research and Higher Education, is looking for a skilled HPC Engineer to join their team for a 6-7 month contract engagement. This project involves scaling and maintaining a critical High-Performance Computing (HPC) ecosystem used by university researchers for parallel processing, AI/ML applications, and massive data transfers. This is a high-impact role that requires a self-motivated, tenured professional who can immediately contribute to the stability and efficiency of a complex, large-scale research computing environment.
Key Responsibilities & Deliverables:
This role is focused on the successful completion of specific tasks and deliverables. Your responsibilities will include:
- Maintain the entire HPC ecosystem, including system specification, provisioning, OS installation (Rocky Linux), and managing updates/changes to approximately 200 Linux systems. This includes login/file transfer nodes, compute nodes, job schedulers (Slurm), and virtualization (VMware).
- Utilize configuration management and security best practices to maintain all systems using Ansible and the Werewolf cluster management system.
- Manage the Globus data transfer software and support the storage team with Vast and TrueNad Storage maintenance. Provide support for data indexing tools like Starburst.
- Maintain and support user-facing HPC web gateways and research tools (e.g., Open OnDemand, Jupyter Notebook/Lab/Hub, FastX, OpenXDMod).
- Respond to outage/urgent systems issues and develop/document continual operational improvements in the HPC system administration service. Assist with vendor management as needed.
We are looking for someone with a proven track record of successful contract engagements. The ideal candidate will have:
- 5+ years of experience in a similar role within a large-scale enterprise or research environment, with a "tenured" approach to system administration.
- Deep expertise in Linux Systems Administration, Ansible, and HPC cluster management tools like Werewolf and the Slurm job scheduler. This isn't a learning role—you need to be a subject matter expert.
- Demonstrated ability to work autonomously and manage your own time effectively to meet project goals and handle critical system issues.
- Experience installing and maintaining common research computing frameworks and software, particularly AI/ML/DL libraries (TensorFlow, PyTorch) and container platforms.
- Familiarity with high-performance storage solutions like Vast Storage and TrueNad Storage, and experience with Globus or a strong willingness to quickly learn.
- Strong communication skills to provide clear and concise status updates to the project team and technical expertise regarding network, storage administration, and data center issues.
- Scripting proficiency in Shell or Python is a plus.





