Senior AI/ML Ops Engineer at Smartsheet responsible for building scalable AI/ML platforms. Collaborating with cross-functional teams to enhance data infrastructure and operational efficiency.
Responsibilities
Designing, Developing and overseeing the strategy and architecture of scalable and reliable AI/ML Ops platforms / pipelines
Model Deployment: Package and deploy AI/ML services to production, ensuring they are reproducible and interpretable
CI/CD Pipeline Development: Design and implement automated CI/CD (Continuous Integration/Continuous Deployment) pipelines to accelerate model deployment using tools
Infrastructure Management: Provision and optimize infrastructure for training and serving, utilizing Docker, Kubernetes, or serverless platforms
Monitoring & Observability : Implement post-deployment monitoring for model performance, data drift, and latency using tools. Experience in Monte Carlo is preferable
Automation: Automate retraining and data pipeline workflows to ensure models stay accurate over time.
Manage the deployment of foundation models, fine-tuning workflows, and Retrieval-Augmented Generation (RAG) stacks (Vector DBs, Knowledge Graph. Experience with AWS Bedrock is preferable
Resource Optimization: Manage GPU/CPU utilization to minimize cloud costs while maintaining low-latency inference for users
Collaboration: Work closely with data scientists, data engineers, and software engineers to bridge the gap between model development and production.
Version Control & Governance: Manage versioning for data, code, and models using tools like MLflow.
Security & Compliance: Implementing data security measures, ensuring compliance with data governance policies, and protecting sensitive data
Technology Evaluation and Innovation: Staying abreast of emerging data technologies and exploring opportunities for innovation to improve the organisation’s data infrastructure
Troubleshooting and Problem Solving: Diagnosing and resolving complex data-related issues, ensuring the stability and reliability of the data platform
Perform other duties as assigned
Requirements
Enterprise SaaS software solutions with high availability and scalability
Solution handling large scale structured and unstructured data from varied data sources
Experience in building and maintaining AI/ML Ops platform systems ensuring scalability, reliability, efficiency and security
Working with Product engineering team to influence designs with data, AI and analytics use cases in mind
In depth experience in System design, AI/ML Frameworks and tools involving large Petabytes of data with Databricks Lakehouse ecosystem
AI/MLOps workflows on Databricks , MLFlow, Mosaic AI Agent Framework, Unity Catalog, Vector Search, Knowledge Graph
Knowledge of AI/ML frameworks like LangChain, LangGraph for AI/ML Ops pipeline integration
Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, or GCP). Experience in AWS hosted data platform is preferable
Programming languages like Python and SQL
Modern software engineering practices like Kubernetes, CI/CD, IAC tools (Preferably Terraform), Observability, monitoring and alerting
Solution Cost Optimisations and design to cost
Legally eligible to work in India on an ongoing basis.
Master Thesis focusing on developing machine learning models for lithium - ion cell sorting at Fraunhofer LBF. Involvement in innovative projects addressing circular economy in battery recycling.
Machine Learning Engineer designing and implementing AI systems focused on Japanese language challenges at Woven by Toyota. Involves technical R&D, system design, and collaboration with cross - functional teams.
Principal Software Engineer leading MLOps within Analytics Platform at Sun Life. Focused on AWS and machine learning operations, collaborating across technical and business teams.
Machine Learning Engineer designing and optimizing deep learning models for safety - critical environments at Destinus. Shaping the future of high - speed, autonomous flight technologies.
Machine Learning Engineer optimizing personalization systems for Spotify's audio streaming service. Collaborating with cross - functional teams to enhance user experience and deliver recommendations.
Principal Machine Learning Engineer developing ML and GenAI solutions in a cloud - native environment at Flexera. Leading a high - impact team and driving operational excellence for ML infrastructure.
Senior ML Platform/Ops Engineer building ML systems for AI - powered learning at Preply. Productionizing machine learning with high reliability, performance, and observability in a hybrid environment.
Senior ML Platform/Ops Engineer building AI - powered ML pipelines for a dynamic Ed - Tech company. Collaborating with ML scientists and engineers to ensure reliable deployment and observability.
Machine Learning Engineer developing advanced Deep Learning models for autonomous driving technology at Mobileye. Collaborating in a high - end algorithmic engineering team on critical computer vision challenges.