About the role

Staff Research Engineer applying and optimizing AI/ML models to solve biomedical problems at Chan Zuckerberg Initiative. Collaborating with teams to develop and deploy ML models for research and insights.

Responsibilities

Define and execute the long-term vision and roadmap for AI, data, cloud, and security infrastructure, with clear metrics to measure progress and outcomes.
Oversee the design and operation of hybrid GPU compute clusters and ML platforms to support training, fine-tuning, and inference workloads.
Ensure robust, scalable, and efficient data infrastructure and cloud operations to power analytics, ML pipelines, and product needs.
Drive reliability, observability, and cost optimization across GPU based workloads for development, training and inference.
Implement modern AI/ML Ops practices (orchestration of model training workloads, reproducibility, automated monitoring) to accelerate research and production workflows, with a focus on continuous delivery and improvement.
Build, mentor, and scale high-performing, multi-disciplinary engineering teams.
Partner with product, research, and executive leadership to align infrastructure with organizational priorities, ensuring delivery is measured against agreed objectives and key results.
Establish policies for infrastructure usage, prioritization, and compliance with regulatory requirements.
Stay ahead of emerging technologies in AI infrastructure, cloud, and security; drive their strategic adoption.

15+ years in engineering, with at least 7+ years in senior leadership roles managing multi-disciplinary teams and organizations of 30+ employees, with experience leading and developing managers.
Strong knowledge of AI/ML frameworks (e.g., PyTorch) and MLOps tools (e.g., Kubeflow, MLflow, Ray).
Experience managing both traditional cloud platforms (AWS, GCP, Azure) and AI cloud (HPC/GPU clusters).
Deep experience with large-scale data systems, pipelines, and storage technologies.
Track record of improving reliability, observability, and cost efficiency in large-scale systems.
Proven ability to define multi-year infrastructure strategies while delivering on immediate priorities.
Exceptional written and verbal communication skills, capable of engaging technical and non-technical audiences.
Ability to provide clear leadership and momentum in an ambiguous environment—setting direction, aligning teams, and turning uncertainty into forward progress.

CZI provides a generous employer match on employee 401(k) contributions to support planning for the future.
Annual benefit for employees that can be used most meaningfully for them and their families, such as housing, student loan repayment, childcare, commuter costs, or other life needs.
CZI Life of Service Gifts are awarded to employees to “live the mission” and support the causes closest to them.
Paid time off to volunteer at an organization of your choice.
Funding for select family-forming benefits.
Relocation support for employees who need assistance moving to the Bay Area.
And more!