About the role

AI/ML Data Engineer designing and building scalable data pipelines for personalized experiences at College Board. Collaborating cross-functionally to drive impactful student engagement solutions.

Responsibilities

Design, build, and own batch and streaming ETL (e.g., Kinesis/Kafka → Spark/Glue → Step Functions/Airflow) for training, evaluation, and inference use cases
Stand up and maintain offline/online feature stores and embedding pipelines (e.g., S3/Parquet/Iceberg + vector index) with reproducible backfills
Implement data contracts & validation (e.g., Great Expectations/Deequ), schema evolution, and metadata/lineage capture (e.g., OpenLineage / DataHub /Amundsen)
Optimize lakehouse/warehouse layouts and partitioning (e.g., Redshift/Athena/Iceberg) for scalable ML and analytics
Productionize training and evaluation datasets with versioning (e.g., DVC/LakeFS) and experiment tracking (e.g., MLflow)
Build RAG foundations: document ingestion, chunking, embeddings, retrieval indexing, and quality evaluation (precision@k, faithfulness, latency, and cost)
Collaborate with DS to ship models to serving (e.g., SageMaker/EKS/ECS), automate feature backfills, and capture inference data for continuous improvement
Define SLOs and instrument observability across data and model services (freshness, drift/skew, lineage, cost, and performance)
Embed security & privacy by design (PII minimization/redaction, secrets management, access controls), aligning with College Board standards and FERPA
Build CI/CD for data and models with automated testing, quality gates, and safe rollouts (shadow/canary)
Maintain docs-as-code for pipelines, contracts, and runbooks; create internal guides and tech talks
Mentor peers through design reviews, pair/mob sessions, and post-incident learning.

Requirements

4 + years in data engineering (or 3+ with substantial ML productionization)
Strong Python and distributed compute (Spark/Glue/Dask) skills
Proven experience shipping ML data systems (training/eval datasets, feature or embedding pipelines, artifact/version management, experiment tracking)
MLOps / LLMOps: orchestration (Airflow/Step Functions), containerization (Docker), and deployment (SageMaker/EKS/ECS); CI/CD for data & models
Expert SQL and data modeling for lakehouse/warehouse (Redshift/Athena/Iceberg), with performance tuning for large datasets
Data quality & contracts (Great Expectations/Deequ), lineage/metadata (OpenLineage/DataHub/Amundsen), and drift/skew monitoring
Cloud experience preferably with AWS services such as S3, Glue, Lambda, Athena, Bedrock, OpenSearch, API Gateway, DynamoDB, SageMaker, Step Functions, Redshift and Kinesis
BI tools like Tableau, Quicksight, or Looker for real-time analytics and dashboards
Security and privacy mindset; ability to design compliant pipelines handling sensitive student data
An ability to judiciously evaluate the feasibility, fairness, and effectiveness of AI solutions and articulate considerations and concerns around implementing models in the context of specific business applications
Excellent communication, collaboration, and documentation habits.
Preferred RAG & vector search experience (OpenSearch KNN/pgvector/FAISS) and prompt/eval frameworks
Real-time feature engineering (Kinesis/Kafka) and low-latency stores for online inference
Testing strategies for ML systems (unit/contract tests, data fuzzing, offline/online parity checks)
Experience in higher-ed/assessments data domains.

Benefits

Annual bonuses and opportunities for merit-based raises and promotions
A mission-driven workplace where your impact matters
A team that invests in your development and success

Hybrid AI/ML Data Engineer

at The College Board

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

Data Engineer, Databricks, Google Cloud Platform (GCP)

Minor Hotels Europe and Americas

Associate Data Engineer

Minor Hotels Europe and Americas

Associate Data Engineer, Privacy Impact Assessments

Truist

Data Engineer

CVS Health

Senior Data Engineer

CVS Health

Senior Data Engineer

MetroStar

Data Architect

Kyndryl

Senior Data Engineer

Wavestone

AI and Data Engineer

EEOC

Intermediate Data Developer – Data Engineering

Plusgrade