GPU Engineer - Contract - Remote
Descrizione dell'offerta
Our client, part of the UN, are looking to hire a contract GPU Engineer initially until the end of 2025.
It's a fully remote position
About the Role
You will conduct comprehensive performance benchmarking, profiling, and tuning of GPU workloads to provide evidence-based recommendations on suitable GPU sharing techniques
Responsibilities
- Optimize performance of existing and new applications by leveraging GPU parallelization, identifying bottlenecks, and deploying code and framework-level improvements.
- Perform a thorough analysis of the deployment methods for GPU-accelerated serving frameworks in the market, with reference implementations and best-practice recommendations for large-scale serving solutions (e.g., NVIDIA Triton Inference Server, TensorRT, ONNX Runtime).
- Develop repeatable and automated configuration templates for GPU resources.
- Implement active GPU monitoring, including review and analysis of all relevant metrics (utilization, memory bandwidth, power, temperature, etc.), and establish dashboards and alerts for proactive performance and health management.
- Integrate GPU resource provisioning and configuration into CI/CD pipelines using Infrastructure as Code (IaC) tools (e.g., Terraform, Helm Charts, etc), and document workflows for seamless deployment and rollback.
- Document all configurations, testing results, benchmarking analyses, and deployment procedures to ensure transparency and reproducibility.
- Establish active GPU monitoring protocols, including the identification and evaluation of available metrics, to select the most relevant indicators for ongoing performance management.
- Support self-service deployment of Large Language Models (LLMs) on GPU resources, enabling application owners with varying technical expertise to access and utilize GPU capabilities seamlessly.
Qualifications
- Minimum 2 years of hands-on experience in GPU engineering or cloud-based GPU workload optimization, ideally within enterprise or large-scale environments.
- NVIDIA Certified (Preferred).
Required Skills
- Direct experience with GPU services, including resource provisioning, scaling, and optimization.
- Demonstrable expertise in GPU-accelerated software development (CUDA, OpenCL, TensorRT, PyTorch, TensorFlow, ONNX, etc.).
- Strong background in performance benchmarking, profiling (Nsight, nvprof, or similar tools), and workload tuning.
- Experience with Infrastructure as Code (Terraform, HELM Charts, or equivalent) for automated cloud resource management.
- Proven experience designing and implementing CI/CD pipelines for GPU-enabled applications using tools like GitHub Actions (Preferred) or similar.
- Working knowledge of Kubernetes and GPU scheduling, including setup of GPU-enabled clusters and deployment of GPU workloads in Kubernetes.
- Familiarity with GPU monitoring and observability, using tools such as Prometheus, Grafana, NVIDIA Data Center GPU Manager (DCGM), or custom scripts.
- Proven ability to analyze deployment approaches for GPU-accelerated serving frameworks and deliver reference implementations.
- Experience implementing software quality engineering practices (unit testing, code review, test automation, reproducibility).
- Strong scripting skills in Python, Bash, or PowerShell for automation and monitoring purposes.
- Excellent analytical, problem solving, and troubleshooting abilities.
- Quick learner, adaptable to evolving requirements and emerging GPU/cloud technologies.
- Positive and collaborative attitude in Agile environments.