Hybrid Senior Performance and Development Engineer

Posted 2 weeks ago

Apply now

About the role

  • Build AI models, tools and frameworks that provide real time application performance metrics that can be correlated with system metrics.
  • Develop automation frameworks that empower applications to thoughtfully predict and overcome system/infrastructure failures, ensuring fault tolerance.
  • Collaborate with software teams to pinpoint performance bottlenecks.
  • Design, prototype, and integrate solutions that deliver demonstrable performance gains in production environments.
  • Adapt and enhance communication libraries to seamlessly support innovative network topologies and system architectures.
  • Design or adapt optimized storage solutions to boost Deep Learning efficiency, resilience, and developer productivity.

Requirements

  • BS/MS/PhD (or equivalent experience) in Computer Science, Electrical Engineering or a related field.
  • 12+ years of proven experience in analyzing and improving performance of training applications using PyTorch or similar framework.
  • Building distributed software applications using collective communication libraries such as MPI or NCCL or UCC.
  • Construct storage solutions for Deep Learning applications.
  • Building automated fault tolerant distributed applications.
  • Building tools for bottleneck analysis and automation of fault tolerance in distributed environments.
  • Strong background in parallel programming and distributed systems.
  • Experience analyzing and optimizing large scale distributed applications.
  • Excellent verbal and written communication skills.

Benefits

  • Equity and benefits

Job title

Senior Performance and Development Engineer

Job type

Experience level

Senior

Salary

$224,000 - $356,500 per year

Degree requirement

Postgraduate Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job