Data Engineer building data infrastructure at Aldea, a multi-modal AI company. Designing and scaling data pipelines for language and speech domains at large token scales.
Responsibilities
Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Requirements
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Senior Finance Data Architect responsible for shaping finance data strategy at Standard Life. Leading enterprise - level data architecture for regulatory reporting and strategic insights.
Cloud Data Engineer at SEB focusing on Customer Relationship Management. Joining the Cloud Data Engineering Team to optimize data pipelines and improve customer insights.
Data Architect designing scalable data architectures for analytics and reporting at XTEL. Collaborating with international teams to ensure data quality and infrastructure improvements.
Lead Enterprise Data Architect building and owning foundational data management capabilities at a technology - driven company. Enhancing data architecture for AI and operational use with strategic leadership and technical expertise.
Associate Data Engineer supporting data engineering projects at The Hartford in Hartford, CT and Charlotte, NC. Engaging in projects that involve data analysis and developing data assets using various technologies.
Senior Data Engineer / Snowflake Architect leading the design and optimization of data solutions. Working closely with clients and internal teams to build scalable architectures in a hybrid environment.
Data Engineer II developing ETL/ELT solutions for higher education data warehouse. Ensuring reliable institutional data for strategic decision - making by university leaders.
Associate Data Engineer building and maintaining data systems at Incedo. Transforming raw information into accessible datasets for decision - making and collaborating with analysts and data scientists.
Intern Data Engineer at Flutter Studios learning to create applications in the gaming industry. Join a cross - functional team to implement a basic application in an Agile environment.
AI/Data Engineer at Comcast developing data pipelines and AI solutions for audit processes. Leading team efforts to ensure data quality and compliance with audit objectives across business units.