DevOps Engineer developing and enhancing machine Learning infrastructure. Collaborating with AI teams to support ML projects in an Enterprise SaaS startup for contact centers.
Responsibilities
Design, build, and develop/enhance state of art machine Learning system infrastructure (cloud and on-premise) core components and architect platforms to create, train and deploy ML models.
Build operating dashboards and charts to track system errors, performance and enable root cause analysis.
Identify gaps and evaluate relevant tools and technologies as needed to improve processes and systems, leveraging open-source and cloud computing technologies to build effective solutions.
Collaborate with the AI team to drive ML projects from conception to completion and production monitoring.
Requirements
Bachelor's or above with a good academic background.
2-4 years of meaningful work experience in DevOps handling complex services.
Strong troubleshooting skills to keep our services highly available.
Strong expertise and experience with Google Cloud Platform (GCP), Docker, Kubernetes, CI/CD, and Jenkins.
Extensive experience in designing, implementing, and maintaining infrastructure as code, preferably using Terraform.
Create and maintain deployment manifest files for microservices using HELM.
Having LLMOps or MLOps experience is a bonus.
Strong expertise is required with deployment at scale on a Kubernetes cluster via HPA.
Broad technical background and experience with architecture, design, and operations of cloud solutions and how to meet security compliance requirements.
Monitoring system health, ensuring security, scalability, and reliability.
Design, implement, and maintain observability, monitoring, logging, and alerting using tools like Prometheus, Grafana, Promtail, Loki, and Datadog.
Benefits
market-leading compensation, based on the skills and aptitude of the candidate.
Mechanical Reliability Engineer at Cargill ensuring asset reliability through advanced maintenance practices. Collaborating with teams and overseeing projects in heavy industrial processes.
Sr. DevOps Engineer at AllTrails focused on enhancing infrastructure reliability and security. Collaborating with engineering teams and contributing to projects for system optimization.
Senior IT Analyst focusing on SRE for Itaú, the largest bank in Latin America. Ensuring reliability and performance of critical systems through automation and incident resolution.
Site Reliability Engineer focusing on building scalable systems and maintaining high service uptime at Trade Nation. Collaborating with developers and product teams at a global trading firm.
DevOps Engineer part of index services team, ensuring uptime and collaborating across locations. Responsible for software solution integration and deployment in agile cloud environment.
Senior DevOps Engineer at Eletromidia ensuring reliability and performance of a media sales platform. Collaborating with development teams to deliver optimal solutions in a highly scalable environment.
Reliability Engineer at Müller responsible for maintaining and optimizing filling machinery efficiency. Join our team in Market Drayton, a fast - moving and complex manufacturing environment.
DevOps Engineer integrating and managing Big Data applications for clients in an innovative AI company. Working with advanced technologies in a hybrid environment.