Data Engineer building data infrastructure at Aldea, a multi-modal AI company. Designing and scaling data pipelines for language and speech domains at large token scales.
Responsibilities
Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Requirements
Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
Competitive base salary
Performance-based bonus aligned with research and model milestones
Senior Data Engineer driving data intelligence requirements and scalable data solutions for a global consulting firm. Collaborating across functions to enhance Microsoft architecture and analytics capabilities.
Experienced AI Engineer designing and building production - grade agentic AI systems using generative AI and large language models. Collaborating with data engineers, data scientists in a tech - driven company.
Intermediate Data Engineer designing and building data pipelines for travel industry data management. Collaborating across teams to ensure reliable data for analytics and reporting.
Data Engineer managing and organizing datasets for AI models at Walaris, developing AI - driven autonomous systems for defense and security applications.
Data Engineer designing and maintaining data pipelines at Black Semiconductor. Collaborating with process, equipment, and IT teams to support manufacturing analytics and decision - making.
Junior Data Engineer role focusing on Business Intelligence and Big Data at Avanade. Collaborating on data analysis and SQL queries in a supportive learning environment.
GCP Data Engineer designing and developing data processing modules for Ki, an algorithmic insurance carrier. Working closely with multiple teams to optimize data pipelines and reporting.
Data Engineer at Securian Financial optimizing scalable data pipelines for AI and advanced analytics. Collaborating with teams to deliver secure and accessible data solutions.
IT Data Engineering Co‑Op at BlueRock Therapeutics supports development of scientific data systems. Collaboration on data workflows and foundational AWS data engineering tasks.
Data Engineer I building and operationalizing complex data solutions for Travelers' analytics using Databricks. Collaborating within teams to educate end users and support data governance.