Deliver projects on time: Plan, delegate, execute, and oversee key projects;
Collaborate: Work closely with stakeholders and other teams. Mentor colleagues and lead knowledge transfer;
Ensure quality and reduce technical debt: Deliver solutions with solid design and address blockers, toil, and debt to keep systems healthy;
Drive engineering excellence: Aim for quality and choose the right solution for the problems we face;
Protect solution quality: Ensure designs are implemented with proper quality and minimal tech debt;
Data‑backed decisions: Help teams and stakeholders navigate data and act on insights;
Design and maintain highly available, scalable infrastructure with monitoring, alerting, and anomaly detection;
Automate everything: Create and optimize automation to streamline deployments, improve speed, and cut manual work;
Solve complex issues: Troubleshoot, debug, and resolve critical issues in complex systems;
Use AI: Integrate AI into workflows and processes to speed up delivery and reduce toil.
Requirements
Observability: Experience with monitoring tools and frameworks to ensure system observability (OpenSearch, VictoriaMetrics, Prometheus, Thanos, Mimir, OpenTelemetry, Nagios);
Databases and storage systems: Experience operating highly available SQL, NoSQL databases, and object stores at scale (MySQL, Percona, PostgreSQL, Cassandra, ClickHouse, Timescale, Druid, MinIO);
Data visualization: Ability to build meaningful dashboards that show the right insights (Grafana, OpenSearch Dashboards);
Alerting and anomaly detection: Ability to build anomaly detection and alerting pipelines;
Programming: Proficiency in one or more programming languages for automation scripts and integrations (Python, Go, Rust, C);
Linux: Strong knowledge of Linux systems, especially Debian‑based distributions;
Workflow: Ability to use workflow automation frameworks (Airflow, Prefect, n8n);
Configuration management: Ability to design and develop configuration management codebases and deployment pipelines (SaltStack, Ansible, Rundeck);
Networking: Strong understanding of networking protocols and concepts (Overlay, VPN, Proxy, DNS, HTTP, SSL, TCP, UDP);
Security: Ability to design secure systems and working knowledge of security concepts and tools (Vault, PKI, mTLS).
Network & Datacenter Deployment Engineer at Cloudflare focused on building and expanding their global network infrastructure with collaboration across multiple engineering teams and vendors.
Senior DevOps Engineer leading cloud - native solutions at Sparksoft Corporation. Driving automation and system reliability within a fast - paced Agile team.
Platform Engineer focusing on supporting CI/CD pipelines and Kubernetes at PCCW. Responsible for ensuring platform services' reliability and performance, with night - time support as needed.
Site Reliability Engineer at Bumble optimizing large - scale Linux environments and ensuring system stability. Focusing on troubleshooting, incident recovery, and performance tuning in complex infrastructures.
Senior DevOps Manager overseeing CI/CD processes for NVIDIA Networking products. Leading a team and collaborating with global teams to enhance R&D efficiency and infrastructure.
DevOps Manager overseeing engineering team developing scalable CI/CD processes for NVIDIA Networking products. Enhancing global R&D efficiency in a technology - focused company.
Join Operations Team as Senior Site Reliability Engineer driving operational excellence for cybersecurity solutions. Collaborate across teams to manage production platforms and optimize infrastructure.
Software Developer - DevOps System Administrator working within the SCMT team to enhance software application efficiency. Collaborating on tools and scripts for application lifecycle management.