Research Engineer developing infrastructure for Aldea's multi-modal AI research team. Building systems that support rapid experimentation at billion-parameter scale in language and speech domains.
Responsibilities
Build and maintain distributed training infrastructure supporting researchers across language and speech domains at a billion-plus-parameter scale.
Optimize training and inference performance across the stack, delivering significant speedups through framework optimization, custom kernels, and system-level improvements.
Design experiment infrastructure including automated evaluation pipelines, experiment tracking, and monitoring systems that enable rapid iteration.
Scale infrastructure from single-node to multi-node distributed training and deploy production inference systems for real-time applications.
Support researchers with fast turnaround on infrastructure issues and maintain high reliability across all systems.
Collaborate with research scientists, data engineers, and leadership to define technical priorities and infrastructure roadmap.
Requirements
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience.
3+ years of experience with PyTorch and distributed training frameworks (DDP, FSDP, DeepSpeed, or similar).
Experience training large-scale deep learning models at 1B+ parameters.
Deep understanding of training optimization techniques including mixed precision, gradient checkpointing, and memory management.
Proven ability to build production-grade ML infrastructure with high reliability.
Track record of delivering significant performance optimizations in ML training or inference systems.
Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Research Engineer developing conditional generative adversarial networks for synthetic microstructures in PMC and CMC. Collaborating with AFRL and advancing materials performance through data - driven insights.
Senior Research Engineer developing technical solutions for mountain athletes at Arc’teryx. Working with interdisciplinary teams to innovate outdoor apparel and equipment.
Senior Network Model Engineer enhancing operational technology for PG&E’s electrical grid. Driving GIS model integration and collaboration across multiple teams within the organization.
Research Engineer in AI focusing on speech models to create impactful B2B applications. Collaborating with teams to deliver innovative AI - driven solutions at OpenAI.
Research Engineer at Bigblue developing and maintaining algorithms to optimize warehouse operations. Collaborating with diverse teams for operational feasibility and impactful solutions in e - commerce logistics.
Senior Scientist driving the digital infrastructure for drug discovery at Lilly. Leading the design of scalable applications and data systems while collaborating with AI/ML and lab operations.
Research Engineer improving AI models through RL environment design and vendor management. Exciting role merging ML research, data operations, and project management.
Research Engineer responsible for developing models related to atmospheric perils at Verisk. Collaborate with experts in engineering and meteorology to assess impacts of catastrophic events.
Audiological Research Engineer developing innovative hearing solutions and services for hearing impaired people. Collaborating in a multidisciplinary team to drive functional innovation and evaluate user benefits of AI capabilities.
Assembling devices for various clinical applications based on specifications and testing sub - assemblies. Contributing to innovative medical devices at Mass General Brigham.