About the role

  • Own pre- and post-launch performance: plan, execute, and sustain performance validation, debugging, and optimization for adapters, switches, and fabric software—first in lab, then at scale in production
  • Lead performance for post-silicon bring-up validation of networking ASICs and end-products (adapters, switches, etc.); driving optimization and characterization against networking metrics and application performance
  • Deliver white-glove customer support at scale: reproduce field issues, co-debug in shared/onsite labs, land mitigations and durable fixes, and publish per-customer tuning guides
  • Pathfind and optimize forward-looking workloads: drive research and enablement for AI inference (QPS, P99/P99.9, cost/throughput), distributed AI training (NCCL/RCCL collectives), and traditional HPC (manufacturing, life sciences, climate)
  • Multi-fabric research & enablement: evaluate and tune Cornelis/Omni-Path, Ethernet/RoCEv2, and InfiniBand across topologies (Clos/fat-tree/dragonfly), routing (ECMP/adaptive), and congestion control (credit, PFC/ECN/DCQCN)
  • Explore platform designs & tunings end-to-end: CPU/GPU NUMA placement, PCIe/GPU-Direct, BIOS/firmware, PTP/1588, switch/NIC QoS & scheduling, queue depths, microburst tolerance, ECN mark rates, retransmits, fairness
  • Design credible experiments: synthesize representative traffic, replay workload traces, and run on-cluster A/B tests with statistically sound comparisons (P50/P90/P99)

Requirements

  • 10+ years in performance engineering, post-silicon/perf validation, or systems performance for high-speed networking or HPC/AI products
  • Post-silicon expertise: hands-on bring-up and performance validation of networking ASICs/systems (adapters, switches)
  • Demonstrated depth in networking hardware (switch/silicon) and software debug for performance tuning and issue resolution across production-scale deployments
  • Hands-on multi-fabric experience: Cornelis/Omni-Path, Ethernet/RoCEv2, and/or InfiniBand
  • Strong grasp of PCIe/GPU-Direct, queueing/QoS, and congestion control (credit, PFC, ECN, DCQCN)
  • AI/HPC workload fluency: NCCL/RCCL collectives, UCX/ libfabric /MPI
  • Ability to optimize end-to-end training and inference (throughput, QPS, tail latency, efficiency) on real clusters
  • Experimentation & analysis: workload modeling, on-cluster A/B tests, tail-latency analysis (P50/P90/P99)
  • Automation: Python + Linux; data pipelines, dashboards, and CI hooks to prevent performance regressions
  • Excellent cross-functional communication; leads without authority and drives fixes across architecture, firmware, driver, and fabric software teams
  • BS/MS in CE/EE/CS (or equivalent experience)
  • Preferred: Experience supporting customer-facing performance optimization or field application engineering
  • Preferred: Built or led aspects of a white-glove performance support program; mentored engineers and scaled best practices via playbooks and labs
  • Preferred: Inference-stack familiarity (e.g., NVIDIA Triton, TensorRT -LLM, vLLM ) incl. batching, KV-cache, and MIG/MPS trade-offs
  • Preferred: Benchmarking background: MLPerf exposure; HPC app tuning ( e.g., LS-Dyna, Fluent, OpenFOAM , GROMACS) and OSU/MPI microbenchmarks
  • Preferred: Contributions to UCX, libfabric , NCCL/RCCL, or kernel networking; comfort with eBPF /perf/ tcpdump and detailed switch/NIC telemetry
  • Preferred: Deep understanding of networking and memory data flows, including technologies such as DPDK, RDMA, or similar high-performance I/O frameworks

Benefits

  • health and retirement benefits
  • medical, dental, and vision coverage
  • disability and life insurance
  • dependent care flexible spending account
  • accidental injury insurance
  • pet insurance
  • generous paid holidays
  • 401(k) with company match
  • Open Time Off (OTO) for regular full-time exempt employees
  • sick time, bonding leave, and pregnancy disability leave

Job title

Principal Performance Engineer

Job type

Experience level

Lead

Salary

Not specified

Degree requirement

Bachelor's Degree

Tech skills

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job