Hybrid Post-Training Research Engineer

Posted last week

Apply now

About the role

  • Post-Training Research Engineer at Baseten developing tooling for efficient AI model training. Collaborating on diverse architectures and systems-level concepts to enhance performance in AI applications.

Responsibilities

  • Your role as a research engineer is to build the in-house tooling to support all of this.
  • We care about training a wide spectrum of different model architectures with a variety of techniques efficiently and at scale.
  • At times this involves zooming deep into a particular technical topic, but more often if involves working across the stack as a whole - systems-level concepts like Kubernetes, cgroups, storage systems, and networking topologies, as well as PyTorch distributed tensor computation, and GPU kernels.

Requirements

  • A deep understanding of modern ML techniques and tools for training transformers
  • Advanced experience in a tensor/array computation library like PyTorch, TensorFlow, Jax, or similar
  • A detailed understanding of transformer training parallelism strategies like data parallelism, sharded data parallelism, tensor parallelism, pipeline parallelism, context parallelism
  • The experience and knowledge to profile and improve the performance of a distributed GPU program in PyTorch or a similar library
  • The ability to perform roofline analysis on a transformer training setup
  • A willingness to dive into messy problems, work with researchers, derive specifications by asking important questions, and execute
  • Familiarity with HPC and distributed computing platforms like Slurm, Ray, Kubernetes, and Dask
  • Familiarity with cluster networking technology like Infiniband, RoCE, GPUDirect
  • Solid fundamentals in operating systems concepts like processes, files, kernel drivers, containerisation, and networking protocols
  • A sense of creativity and willingness to ask difficult questions about our approach, assumptions, and tooling choices.

Benefits

  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.

Job title

Post-Training Research Engineer

Job type

Experience level

Mid levelSenior

Salary

$200,000 - $275,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job