Senior Site Reliability Engineer leading AI Native platform operations in a growing B2B generative AI startup. Ensuring infrastructure reliability and scalability for services.
Responsibilities
Design cloud and on‑prem infrastructure, and lead Docker/Kubernetes operations (optimizing autoscaling, rollouts, security).
Develop reliable pipelines (Git/gates/automation) and implement end-to-end observability (SLOs/SLIs/SLAs, logs/metrics/tracing).
Operate microservices (service mesh, resilience patterns) and manage critical data (PostgreSQL HA/tuning).
Manage secrets, access policies, supply chain security and system hardening.
Implement Infrastructure as Code and GitOps (Terraform/Helm/ArgoCD).
Lead incident response and postmortems with data- and AI-driven continuous improvement.
Align with Engineering, Product, Data and ML teams.
Requirements
6+ years in SRE/DevOps/Platform engineering at high scale.
Strong expertise in Kubernetes, Docker, CI/CD, observability (SLOs), PostgreSQL, microservices architecture, security, and experience with IaC and GitOps.
Passion for applying LLMs/AI to operations.
Experience with Node.js/Python, NestJS/React, Git/Cursor, GCP (other clouds a plus), PostgreSQL, Docker/Kubernetes, Terraform/Helm/ArgoCD.
Experience with AI SDKs/LLMs, operational automations (n8n/Crew.ai), vector databases (RAG/pgvector), Kafka/RabbitMQ, FinOps, chaos engineering, SAST/DAST.
Benefits
True autonomy and a highly collaborative environment;
Direct influence on product and team development;
Opportunity to grow with the business from the ground up;
Fixed salary of R$28,000/month (PJ contract) plus real possibility of Stock Options;
Site Reliability Engineer ensuring the availability and performance of services for autonomous vehicle operations. Collaborating on system design and automation in a robotics - focused environment.
DevOps Engineer automating continuous deployment and monitoring on AWS for Crown Equipment Corporation. Bridging developers, IT, and external providers for operational efficiency.
Senior DevOps Engineer responsible for leading CI/CD pipeline design and optimization. Collaborating with teams to drive DevOps maturity across the enterprise while managing infrastructure automation.
Cloud Operations Engineer ensuring reliable performance of cloud systems at 2Innovate. Focused on automation, incident management, cloud security, and infrastructure monitoring in cloud environments.
AWS DevOps Engineer responsible for delivering scalable digital experiences for EXL's MarTech ecosystem. Engaging in development, maintenance, and collaboration across stakeholders and services.
Senior Site Reliability Engineer managing critical infrastructure at Hornetsecurity. Collaborating with product teams to ensure performance and reliability across services.
Site Reliability Engineer enhancing platform reliability for AI workflows at WRITER. Overseeing automated solutions and cloud infrastructure supporting high - trafficked AI systems.
Site reliability engineer ensuring 24/7 availability of AI - powered workflows at WRITER. Developing and automating robust platforms for high - traffic AI demands.