Senior Site Reliability Engineer leading AI Native platform operations in a growing B2B generative AI startup. Ensuring infrastructure reliability and scalability for services.
Responsibilities
Design cloud and on‑prem infrastructure, and lead Docker/Kubernetes operations (optimizing autoscaling, rollouts, security).
Develop reliable pipelines (Git/gates/automation) and implement end-to-end observability (SLOs/SLIs/SLAs, logs/metrics/tracing).
Operate microservices (service mesh, resilience patterns) and manage critical data (PostgreSQL HA/tuning).
Manage secrets, access policies, supply chain security and system hardening.
Implement Infrastructure as Code and GitOps (Terraform/Helm/ArgoCD).
Lead incident response and postmortems with data- and AI-driven continuous improvement.
Align with Engineering, Product, Data and ML teams.
Requirements
6+ years in SRE/DevOps/Platform engineering at high scale.
Strong expertise in Kubernetes, Docker, CI/CD, observability (SLOs), PostgreSQL, microservices architecture, security, and experience with IaC and GitOps.
Passion for applying LLMs/AI to operations.
Experience with Node.js/Python, NestJS/React, Git/Cursor, GCP (other clouds a plus), PostgreSQL, Docker/Kubernetes, Terraform/Helm/ArgoCD.
Experience with AI SDKs/LLMs, operational automations (n8n/Crew.ai), vector databases (RAG/pgvector), Kafka/RabbitMQ, FinOps, chaos engineering, SAST/DAST.
Benefits
True autonomy and a highly collaborative environment;
Direct influence on product and team development;
Opportunity to grow with the business from the ground up;
Fixed salary of R$28,000/month (PJ contract) plus real possibility of Stock Options;
DevOps Engineer focusing on deploying high - security on - prem infrastructure and MLOps platforms for mission - critical systems. Collaborating on Kubernetes - based orchestration and machine learning workloads.
Cloud Site Reliability Engineer managing Solace Cloud services across leading cloud providers. Ensuring reliability, handling incidents, and collaborating with customers for operational excellence.
Senior Cloud Site Reliability Engineer ensuring reliability and health of Solace Cloud Services with hands - on cloud operations expertise. Lead incident management and customer support for high - impact environments.
DevOps Engineer designing and operating AWS infrastructure within industrial IoT environments. Working on systems that ensure security, resilience, and end - to - end observability.
Sr. Site Reliability Engineer (SRE) III providing technical solutions for the federal government. Collaborating in a high - performing team focused on reliability and application scalability.
Senior Linux System Engineer developing and maintaining Linux server infrastructure for Th. Geyer GmbH. Collaborating on ERP systems and CI/CD processes while ensuring system performance and security.
Platform Engineer leading the development of cloud application platforms for Allstate. Responsible for cloud infrastructure for ML experimentation and production deployments.
Cloud Platform Engineer (ML DevOps) developing and managing CI/CD pipelines for ML workflows in a leading insurance company. Collaborating with data scientists and ensuring infrastructure security and compliance.