DevOps Engineer developing and enhancing machine Learning infrastructure. Collaborating with AI teams to support ML projects in an Enterprise SaaS startup for contact centers.
Responsibilities
Design, build, and develop/enhance state of art machine Learning system infrastructure (cloud and on-premise) core components and architect platforms to create, train and deploy ML models.
Build operating dashboards and charts to track system errors, performance and enable root cause analysis.
Identify gaps and evaluate relevant tools and technologies as needed to improve processes and systems, leveraging open-source and cloud computing technologies to build effective solutions.
Collaborate with the AI team to drive ML projects from conception to completion and production monitoring.
Requirements
Bachelor's or above with a good academic background.
2-4 years of meaningful work experience in DevOps handling complex services.
Strong troubleshooting skills to keep our services highly available.
Strong expertise and experience with Google Cloud Platform (GCP), Docker, Kubernetes, CI/CD, and Jenkins.
Extensive experience in designing, implementing, and maintaining infrastructure as code, preferably using Terraform.
Create and maintain deployment manifest files for microservices using HELM.
Having LLMOps or MLOps experience is a bonus.
Strong expertise is required with deployment at scale on a Kubernetes cluster via HPA.
Broad technical background and experience with architecture, design, and operations of cloud solutions and how to meet security compliance requirements.
Monitoring system health, ensuring security, scalability, and reliability.
Design, implement, and maintain observability, monitoring, logging, and alerting using tools like Prometheus, Grafana, Promtail, Loki, and Datadog.
Benefits
market-leading compensation, based on the skills and aptitude of the candidate.
Senior DevOps Engineer supporting enterprise - grade Kubernetes infrastructure and CI/CD automation for U.S. Army projects. Engaging in critical system designs and automation processes with a focus on cloud - based platforms.
Reliability Engineer focusing on mechanical systems in a long - standing Australian FMCG company. Ensure ongoing reliability improvements and support plant operations for iconic cereal production.
Software Engineer 2 developing full - stack solutions for U.S. Bank. Collaborating with teams to design and maintain best in class software experiences.
Principal Software Engineer at FIS driving reliability and performance in fintech environments. Collaborating across teams for high - scale, high - reliability solutions in the finance sector.
Senior Software Development Engineer involved in automation testing at CVS Health. Designing, developing, and implementing automated testing solutions in a collaborative environment.
Senior Site Reliability Engineer focusing on reliability and operational excellence of workflow orchestration platforms like Apache Airflow. Engaging in operations and engineering projects in a hybrid setup.
Senior Site Reliability Engineer for observability platforms at Dimensional, ensuring reliability and scaling the infrastructure. Collaborating with teams on operations and engineering projects.
Senior Staff Reliability Engineer for the humanoid robotics team ensuring performance and safety standards. Leading reliability engineering initiatives and mentoring within the engineering team.
Reliability Engineer at Air Liquide optimizing maintenance strategies, ensuring equipment uptime across multiple sites in the United States. Collaborating with teams for continuous improvement and operational excellence.
Senior Azure Engineer at Capgemini responsible for building, operating, and optimizing cloud - native platforms. Collaborating with teams to ensure reliability, performance, and security for critical workloads.