Senior AI Engineer responsible for data preparation in foundation model pre-training for various German-speaking industries. Collaborating on data quality and processing to enhance model capabilities.
Responsibilities
Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.
Requirements
Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
Ownership mentality: you see problems through from diagnosis to solution to deployment.
Willingness to relocate to Heidelberg or travel at least fortnightly.
Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
Rust proficiency (parts of our data pipeline are performance-critical).
Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.
Benefits
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Senior Staff Software Engineer developing AI - augmented solutions for production delivery. Seeking a highly skilled full stack engineer with significant professional experience.
AI Engineer at ShyftLabs focusing on building, fine - tuning, and scaling LLM - based systems. Collaborating with Fortune 500 companies to deliver innovative digital solutions.
Managing Director leading AI Platform development and governance for Baker Tilly's Managed Services across multiple domains. Focused on strategic execution, innovation, and operational excellence.
Senior AI Engineer contributing to the technical architecture and implementation of multi - agent AI systems at Chip. Collaborating with teams to build and scale AI infrastructure for customer - facing products.
Senior AI Specialist leading the delivery of Generative AI solutions at Fiera Capital. Focused on architecture, design, and cross - functional collaboration for scalable deployment.
Senior AI Engineer developing AI - powered features for SecurityScorecard's cyber health solutions. Collaborating on building intelligent automation and delivering customer - facing product enhancements.
Staff AI Engineer leading AI capabilities development for security professionals at SecurityScorecard. Designing and shipping AI - powered initiatives with ownership in a dynamic team environment.
AI Lead managing AI Unit focused on IT/OT - Security at R&C Request GmbH. Leading business impact projects and collaborating cross - functionally in a hybrid work environment.