HPC Storage Engineer at RedLine Performance Solutions

RedLine Performance Solutions

๐Ÿ“Œ United States of America
๐Ÿ•‘ January 25, 2021
๐Ÿท๏ธ Other
๐Ÿท๏ธ Engineering
View Application

You will be redirected to RedLine Performance Solutions's preferred application process.

HPC Storage Engineer

We are located in the Washington, DC area and are looking for a HPC Storage Engineer to join us for our NASA NACS High Performance Computing contract.
U.S. citizenship and the ability to obtain a Public Trust security clearance are mandatory requirements for this position. The position is located at a customer site in Greenbelt, MD. Preference is for local candidates, but the position can be remote for the right candidate. If the candidate works remote, travel to Greenbelt, MD may be required on a quarterly basis. This position will interact with the Program Manager, Site Lead, Customer, and site staff attending regularly scheduled customer meetings to keep the customer informed of activities and progress and answer customer inquiries concerning all aspects of the various the program. An individual at this skill level should have demonstrated hisher problem solving ability in the appropriate area of expertise with numerous technical publications and formal technical presentations, and should have some experience in mentoring and leading others in small team environments. Duties and

Responsibilities

Design (architect), implement and troubleshoot large-scale (tens of Petabytes) storage systems. This includes developing technical drawings including all required cables and connectivity to existing systems, and communicating with key stakeholders. Serve as a GPFS SME for the Discover HPC team as well as other teams running GPFS both within and outside of the immediate organization. Develop and execute test plans for filesystem upgrades and resolving issues, potentially by working with vendors. Resolve user-reported issues (e.g., filesystem, RDMA interconnect, kernel, operating system Evaluate and test proposed changes to the Discover supercomputer's production operating environment (e.g.
OS Patches, Kernel parameter changes) and develop upgradepotential backout plans. Maintain the Discover Test and Development System (TDS), keeping it as close as reasonably possible to the production system configuration. Provide 24x7 on-call support as required.

Requirements

Bachelor's degree in Computer Science, Management Information Systems or other technical discipline plus 5 years of experience, or equivalent. At least five years of experience as a High
- Performance Computing parallel filesystem Storage Administrator, with experience with IBM Spectrum Scale (GPFS), Lustre, or equivalent.
Experience with optimizing for performance, reliability, and security. In-depth knowledge of HPC parallel filesystems and the ability to troubleshoot complex problems. Must be comfortable with monitoring and managing clustered filesystems, and be able to examine GPL driver code when required. Experience with deploying parallel filesystem upgrades in a rolling fashion with no overall system downtime. In-depth knowledge of Linux NFS serverclient implementation and ability to troubleshoot NFS issues. In-depth knowledge of SAN technologies (e.g., FC, FCoE, RoCE, NVMoF, iSER, SRP) and awareness of high-level protocol function, management approaches, and performance tuning. Knowledge of Ethernet networking (VLANs, etc.) Good organization skills to balance and prioritize work, and ability to multitask. Good communication skills to communicate with support personnel, customer, and managers.

Preferred Skills

Experience with debugging issues with the Linux kernel. Ability to produce patches to fix issues, as required. Experience with applying patches and building custom kernels as required to implement functionality or address security concerns. Experience deploying and managing large HPC clusters using image based cluster management tool such as xCAT. Knowledge of configuration management tools (e.g., Puppet, CF
Engine). Working knowledge of scripting and programming languages such as bash, csh, tcsh, Perl, Python, Ruby, Fortran, C, C++. Experience with Infini.
Band or OmniPath high speed fabrics, including subnet management, IPoIB andor IPoOPA mechanisms, fabric topology and health monitoring and integration with MPI. Knowledge of MPI Implementations (Intel MPI, MVAPICH2, OpenMPI, HPESGI MPT) and troubleshooting MPI application stability and performance problems. Experience in building, installing and debugging scientific applications (e.g.
MPI, NetCDF, HDF, WRF). Experience in submitting job scripts to a batch scheduler (ideally Slurm). Experience with cloud technologies (AWS, Azure, Google Cloud Platform), OpenStack, and Kubernetes. Experience with GPFS Cluster Export Services, Clustered NFS, GPFS Multi-cluster. Broad knowledge of distributed file systems and object stores such as Lustre, HDFS, BeeGFS, LizardFS Gluster, Ceph, Swift. Experience with revision control via Git. Familiarity with Time
- Series databases and associated tools (Such as InfluxDB, Graphite, Grafana, Elastisearch, Kibana). Knowledge of virtualization technologies (particularly qemukvm) and managing large numbers of virtual machines.
Regarding RedLine, we've been in the solutions engineering and development services business for over 21 years and are consistently determined to keep the "bar of excellence" quite high for new hires. This enables us to accomplish what other firms cannot and promotes a high level of staff retention.

View Application

You will be redirected to RedLine Performance Solutions's preferred application process.


Job Expires: February 24, 2021

More Work-from-home Jobs

Leidos ๐Ÿ“Œ Hawthorne, NV ๐Ÿ•‘ 2021-01-25 20:01:22

Information Technology Support Specialist

Apply

Applied Information ๐Ÿ“Œ Chevy Chase, MD ๐Ÿ•‘ 2021-01-25 20:01:22

Security Engineer

Apply

GEICO ๐Ÿ“Œ Chevy Chase, MD ๐Ÿ•‘ 2021-01-25 20:01:22

DevOps Site Reliability Engineer, Senior

Apply

TSP ๐Ÿ“Œ Brenham, TX ๐Ÿ•‘ 2021-01-25 20:01:22

Storage Manager

Apply

Uh oh! Something went wrong. Please try again.
We were unable to find any more job. Have you tried changing your search keywords?

Work from the Safety of Your Own Home

You will be redirected to RedLine Performance Solutions's preferred application process.