Post-Training Research Engineer at Baseten developing tooling for efficient AI model training. Collaborating on diverse architectures and systems-level concepts to enhance performance in AI applications.
Responsibilities
Your role as a research engineer is to build the in-house tooling to support all of this.
We care about training a wide spectrum of different model architectures with a variety of techniques efficiently and at scale.
At times this involves zooming deep into a particular technical topic, but more often if involves working across the stack as a whole - systems-level concepts like Kubernetes, cgroups, storage systems, and networking topologies, as well as PyTorch distributed tensor computation, and GPU kernels.
Requirements
A deep understanding of modern ML techniques and tools for training transformers
Advanced experience in a tensor/array computation library like PyTorch, TensorFlow, Jax, or similar
A detailed understanding of transformer training parallelism strategies like data parallelism, sharded data parallelism, tensor parallelism, pipeline parallelism, context parallelism
The experience and knowledge to profile and improve the performance of a distributed GPU program in PyTorch or a similar library
The ability to perform roofline analysis on a transformer training setup
A willingness to dive into messy problems, work with researchers, derive specifications by asking important questions, and execute
Familiarity with HPC and distributed computing platforms like Slurm, Ray, Kubernetes, and Dask
Familiarity with cluster networking technology like Infiniband, RoCE, GPUDirect
Solid fundamentals in operating systems concepts like processes, files, kernel drivers, containerisation, and networking protocols
A sense of creativity and willingness to ask difficult questions about our approach, assumptions, and tooling choices.
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Research Engineer developing agentic systems at Anthropic focused on LLMs and AI applications. Collaborating with researchers to enhance agent performance and tackle complex tasks.
System Modelling Innovation Engineer at Electrolux developing advanced product development system models. Enhancing modeling techniques and optimizing product development for better consumer experiences.
R&D Engineer developing estimation and control strategies for Electrolux appliances. Collaborating with global teams to innovate product features and drive sustainability in consumer electronics.
Principal Research Engineer leading engineering activities in behavior autonomy for Scientific Systems. Overseeing critical technology deliverables, team management, and proposal efforts.
Staff Research Engineer involved in creating a neurosymbolic AI agent at Onton. Focused on optimal decision - making processes and addressing challenges in current AI systems.
Research Engineer focusing on decentralized AI training stack for Prime Intellect. Engaging in novel research, optimizing workloads, and contributing to open - source frameworks.
AI Data Innovation Engineer developing and validating AI capabilities tied to governed enterprise data products at U.S. Bank. Collaborating on AI readiness efforts and supporting data product initiatives.
Research Engineer at Yooz, specializing in AI - driven document automation. Collaborating with R&D to develop innovative technologies and enhance document management solutions.
System Test & Research Engineer developing testing protocols and supporting improvements in Precision Agriculture solutions at Topcon. Collaborating with teams to ensure product quality and performance.
Senior Research Engineer developing mechanical designs for engine demonstrators at GKN Aerospace. Leading technology integration and collaborating across engineering disciplines in aeronautics.