Design, build, and develop/enhance state of art machine Learning system infrastructure (cloud and on-premise) core components and architect platforms to create, train and deploy ML models.
Build operating dashboards and charts to track system errors, performance and enable root cause analysis.
Identify gaps and evaluate relevant tools and technologies as needed to improve processes and systems, leveraging open-source and cloud computing technologies to build effective solutions.
Collaborate with the AI team to drive ML projects from conception to completion and production monitoring.
Requirements
Bachelor's or above with a good academic background.
2-4 years of meaningful work experience in DevOps handling complex services.
Strong troubleshooting skills to keep our services highly available.
Strong expertise and experience with Google Cloud Platform (GCP), Docker, Kubernetes, CI/CD, and Jenkins.
Extensive experience in designing, implementing, and maintaining infrastructure as code, preferably using Terraform.
Create and maintain deployment manifest files for microservices using HELM.
Having LLMOps or MLOps experience is a bonus.
Strong expertise is required with deployment at scale on a Kubernetes cluster via HPA.
Broad technical background and experience with architecture, design, and operations of cloud solutions and how to meet security compliance requirements.
Monitoring system health, ensuring security, scalability, and reliability.
Design, implement, and maintain observability, monitoring, logging, and alerting using tools like Prometheus, Grafana, Promtail, Loki, and Datadog.
Benefits
market-leading compensation, based on the skills and aptitude of the candidate.
Network & Datacenter Deployment Engineer at Cloudflare focused on building and expanding their global network infrastructure with collaboration across multiple engineering teams and vendors.
Senior DevOps Engineer leading cloud - native solutions at Sparksoft Corporation. Driving automation and system reliability within a fast - paced Agile team.
Platform Engineer focusing on supporting CI/CD pipelines and Kubernetes at PCCW. Responsible for ensuring platform services' reliability and performance, with night - time support as needed.
Site Reliability Engineer at Bumble optimizing large - scale Linux environments and ensuring system stability. Focusing on troubleshooting, incident recovery, and performance tuning in complex infrastructures.
Senior DevOps Manager overseeing CI/CD processes for NVIDIA Networking products. Leading a team and collaborating with global teams to enhance R&D efficiency and infrastructure.
DevOps Manager overseeing engineering team developing scalable CI/CD processes for NVIDIA Networking products. Enhancing global R&D efficiency in a technology - focused company.
Join Operations Team as Senior Site Reliability Engineer driving operational excellence for cybersecurity solutions. Collaborate across teams to manage production platforms and optimize infrastructure.
Software Developer - DevOps System Administrator working within the SCMT team to enhance software application efficiency. Collaborating on tools and scripts for application lifecycle management.