Design and maintain monitoring and alerting solutions for infrastructure, application performance, and user experience.
Implement automation tools and processes for routine tasks, scalable infrastructure, and seamless deployments.
Ensure reliability, availability, and performance of applications and services, minimizing downtime and optimizing response times.
Lead incident response, including identification, triage, resolution, and post-incident analysis.
Conduct capacity planning, performance tuning, and resource optimization in collaboration with development and operations.
Collaborate with security teams to implement best practices, perform vulnerability assessments, and ensure compliance.
Manage deployment pipelines, release processes, and configuration management for consistent, reliable deployments.
Identify and drive improvements in reliability, performance, and efficiency through data and root cause analysis.
Create and maintain documentation, runbooks, and knowledge base articles, promoting knowledge sharing.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms.
Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response.
Participate in on-call rotations, providing 24/7 support for critical incidents and coordinating resolution and follow-up.
Requirements
4+ years of hands-on experience in Java, Spring Boot, Hibernate, ORM, JDBC, and Angular.
3+ years working on large-scale, client-facing, enterprise production software.
Proficiency in modern development architectures (web, API), cloud platforms (AWS, Azure, Google Cloud), and infrastructure as code (Terraform, Ansible).
Experience with monitoring and logging tools (Prometheus, Grafana, DataDog, New Relic, Splunk, SumoLogic, ELK Stack), including dashboards and alerts.
Skilled in incident management (response, triage, RCA, post-mortem) and troubleshooting complex technical issues.
Proficiency in scripting languages (Python, Bash) and automation tools.
Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Familiarity with Application Performance Monitoring (APM) and Real User Monitoring (RUM) tools.
Commitment to continuous learning, adaptability, and operational excellence.
Software Engineer designing and developing APIs and components using Java, Golang, and Terraform for internal and external usage. Collaborating with cross - functional teams to define, design, and deliver new features.
Senior Backend Developer specializing in designing and building event - driven systems. Part of a digital advertising team working with major platforms globally.
Senior .NET Backend Developer designing and maintaining cloud - based applications with Azure, focusing on backend services. Collaborating with teams and mentoring junior engineers.
Back - end Developer at Onfly focusing on creating technology solutions for travel management. Join a high - performance team in a recognized workplace in Brazil.
Control Desk role in Goiânia specializing in operational metrics and service level oversight. Engaging in data collection and trend analysis for service efficiency.
Senior Full Stack Developer in banking solutions focused on guarantees and credit management. Engage in high criticality systems deploying .NET and MongoDB expertise in a hybrid model.
Senior Lead Developer and Architect for healthcare applications, leveraging Java and web technologies. Leading a team and ensuring high - quality software development in a hybrid work environment.
Java/Node.js Developer designing, developing, and maintaining scalable applications at Brillio. Collaborating with teams, managing APIs, and optimizing performance following modern software architecture.
Backend Engineer focusing on Storage Infrastructure and developing scalable solutions for Spotify. Engaging in high - profile projects and enhancing foundational systems for audio streaming service.