Data Engineer building data infrastructure at Aldea, a multi-modal AI company. Designing and scaling data pipelines for language and speech domains at large token scales.
Responsibilities
Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Requirements
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Data Engineer building modern Data Lake architecture on AWS and implementing scalable ETL/ELT pipelines. Collaborating across teams for analytics and reporting on gaming platforms.
Chief Data Engineer leading Scania’s Commercial Data Engineering team for growing sustainable transport solutions. Focused on data products and pipelines for BI, analytics, and AI.
Data Engineer designing and building scalable ETL/ELT pipelines for enterprise - grade analytics solutions. Collaborating with product teams to deliver high - quality, secure, and discoverable data.
Entry - Level Data Engineer at GM, focusing on building large scale data platforms in cloud environments. Collaborating with data engineers and scientists while migrating systems to cloud solutions.
Data Engineer responsible for data integrations with AWS technology stack for Adobe's Digital Experience. Collaborating with multiple teams to conceptualize solutions and improve data ecosystem.
People Data Architect designing and managing people data analytics for Gen, delivering actionable insights for HR. Collaborating across teams to enhance data - driven decision - making.
Data Engineer role focused on shaping future connectivity for customers at Vodafone. Involves solving complex challenges in a diverse and inclusive environment.
VP, Senior Data Engineer responsible for designing and developing cloud data solutions for insider risk in Information Security at SMBC. Collaborating with multiple teams to enhance cybersecurity data platform.
Data Engineer responsible for architecting, developing, and maintaining Allegiant’s enterprise data infrastructure. Overseeing transition to cloud hosted data warehouse and developing next - generation data tools.
Senior Data Engineer developing Azure - based data solutions for clients in the Data & AI department. Collaborating with architects and consultants to enhance automated decision making.