Senior SRE Engineer ensuring reliability and performance of AI products at Plaud. Designing scalable systems and leading incident response to improve operational maturity.
Responsibilities
Ensure reliability and performance of Plaud.ai’s AI products at scale
Design and operate highly available, scalable cloud-native systems for AI workloads
Own production reliability, incident response, and on-call practices
Build observability (metrics, logs, tracing) and reliability automation
Define and manage SLOs, SLIs, and error budgets with engineering teams
Drive postmortems and reliability improvements across the platform
Lead incident response and continuous reliability improvement
Partner with product and engineering teams on reliability design
Improve observability and operational maturity
Requirements
8+ years in SRE, Infra, or Platform Engineering roles
Strong experience with cloud platforms (AWS/GCP/Azure)
Hands-on with Kubernetes and distributed systems
Experience in on-call rotation and incident management
Proficient in at least one programming language (Go, Python, Java)
Benefits
Meaningful Ownership An Employee Stock Ownership Plan (ESOP) that gives a real stake in Plaud’s long-term success.
High-Impact Environment Work in a fast-moving, product-driven environment where your ideas directly shape the future of AI productivity.
Comprehensive Health & Retirement Benefits Top-tier medical, dental, and vision insurance for employees and dependents, supported by a generous employer subsidy, plus a 401(k) retirement plan with company matching for full-time employees.
Time Off & Workplace Benefits Unlimited PTO, plus 13 paid holidays, 12 weeks of fully paid parental leave for all parents, a hybrid work model with a minimum of three in-office days per week, and access to high-quality office snacks, drinks, and equipment.
Cutting-Edge AI Tools for Productivity Access to best-in-class AI tools, including Cursor, GPT models, Gemini, Claude, and other frontier AI systems to maximize engineering and execution efficiency.
Best-in-Class Equipment Choice of top-spec laptops, high-performance workstation setups, and cutting-edge Plaud devices for all new hires.
Team & Culture Annual company offsites, team events, and a culture that values craftsmanship, ownership, and velocity.
Senior DevOps Engineer leading design and management of CI/CD pipelines at Neuron7.ai. Collaborating on cloud infrastructure for scalable applications in an innovative tech environment.
Backend Software Engineer responsible for building robust backend systems for AI and analytics products. Collaborating with various teams to enhance platform reliability and performance.
Senior DevOps Engineer responsible for cloud ecosystem architecture at health - tech startup. Building HIPAA/GDPR - compliant foundations and mentoring developers.
Senior Backend Engineer building product features and maintaining infrastructure for insurance platform. Employing tools like Terraform, Kafka, Datadog and Qovery with a strong DevOps focus.
DevOps Systems Engineer supporting customer operations in Annapolis Junction, MD. Responsible for creating, sustaining, and troubleshooting complex operational data flows.
OpenShift Fresher assisting Cloud team in managing containerized applications using Red Hat OpenShift. Supporting CI/CD, deployment automation, and cloud - native application environments.
Site Reliability Engineer for Leidos ensuring reliability, performance, and scalability of complex distributed systems for the Navy - Marine Corps Intranet. Collaborating with teams to maintain and optimize network operations and services.
DevOps Engineer evolving banking infrastructure for a fintech company. Focusing on observability, incident response, and platform automation in a hybrid work setup.
Lead DevOps Engineer developing AI - powered supply chain intelligence solutions at S&P Global Mobility. Collaborate with data scientists and engineers to optimize operational infrastructure and continuous delivery processes.
Lead Site Reliability Engineer managing critical IT systems for S&P Dow Jones Indices. Focused on service availability, incident management, and developer collaboration to enhance operational reliability.