Site Reliability Engineer at UFG responsible for the reliability, performance, and scalability of production systems. Leading improvements through automation and collaboration across technology teams.
Responsibilities
Implement tooling to monitor system health, capacity, and performance at all levels, from hardware through the VMs and all the way to the end-user interface.
Work with the production management team to troubleshoot incidents, restore service, and identify root causes.
Recommend architectural and implementation of changes to products delivered by development teams based on their performance in test, performance, and production environments.
Support continuous improvement of ITIL processes through automation, data driven insights, and proactive problem identification.
Documents and Integrate SRE practices into the ITIL framework, including incident, change, and problem management workflows.
Develop automation for system provisioning, monitoring, deployment, and recovery to reduce manual effort and human error.
Develop and maintain comprehensive runbooks, standard operating procedures (SOPs), and knowledge base articles for recurring operational tasks and incident response actions.
Collaborate with development teams to design resilient architecture and implement best practices for reliability and observability.
Enhance observability by developing and maintaining dashboards, alerts, and performance analytics.
Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
Develop and update problem management documentation, ensuring known errors and workarounds are captured within the ITSM system.
Manage incident response and participate in on-call rotations to ensure service reliability.
Define, document and track key reliability metrics (SLIs, SLOs, SLAs) and implement continuous improvement initiatives.
Drive post-incident reviews (PIRs) and develop actionable insights to prevent future occurrences.
Partner with security teams to ensure systems meet compliance, security, and governance standards.
Evaluate and recommend new tools, technologies, and frameworks to improve operational efficiency.
Monitor network systems, servers, and applications.
Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
Use all necessary tools to investigate performance and reliability of systems in testing environments.
Provide detailed and specific guidance on ways to eliminate bottlenecks, improve resilience, and optimize speed and reliability.
Provide mentorship and technical support to other members of Production Management.
Requirements
Bachelor’s degree in information technology, Computer Science, or a related field, or equivalent experience
10+ years of experience in progressively more demanding enterprise-scale technology roles
3+ years of experience as a Site Reliability Engineer or Senior DevOps Engineer
3+ years in software development, architecture, or related engineering discipline
Advanced experience with multiple enterprise monitoring and observability tools, including Dynatrace, PRTG, DTrace, SolarWinds, and similar
Complete Windows fluency mandatory; similar strengths in LINUX and Unisys Mainframe environments helpful
Excellent problem-solving and communication skills, with the ability to collaborate across cross-functional teams.
Unparalleled understanding of: advanced networking concepts and complete expertise in the entire TCP/IP stack
VM (VMware and HyperV) and physical compute performance and tuning, including networking and storage performance
VM (Java, Python, Browser, and similar VM environments) threading, garbage collection, and general performance
SQL Server expertise, including troubleshooting queries, indexes, and general performance
Experience with unstructured database performance
General understanding of LLM/SLM implementations and GPU implementations
Proficiency in automation and scripting languages
Good understanding of ITIL processes (Incident, Change, Problem, and Service Level Management).
DevOps Engineer supporting cloud modernization for the Department of the Air Force on the Cloud One contract. Involved in systems analysis, security practices, and collaboration with engineering teams.
Journeyman Cloud Operations Engineer maintaining cloud infrastructure across DoD organizations. Supporting DevSecOps and ensuring compliance with security requirements in a high - visibility program.
DevOps Engineer managing cloud - native platforms for Capgemini. Collaborating with development, data/ML, and security teams to deliver scalable solutions on Azure.
Head of IT & DevSecOps at JamLoop, managing internal technology and security improvements. Leading strategy and implementation of cloud infrastructure for efficiency and reliability.
I&E Maintenance and Reliability Engineer at LyondellBasell focused on asset maintenance strategies in a multidisciplinary environment. Collaborating for operational excellence and safety performance at the Pasadena facility.
Manager, DevOps & Cloud Infrastructure overseeing security and operational efficiency in a hybrid environment at Thomson Reuters. Leading teams to deliver secure solutions in on - premises and cloud setups.
DevOps Engineer responsible for building and maintaining the infrastructure of IONOS' AI platform. Collaborating on CI/CD pipelines and ensuring system optimization across various locations.
DevOps Engineer building and supporting cloud infrastructure at PointClickCare. Collaborate with senior engineers and software teams to enhance AI - enabled workloads and improve system reliability.
DevOps specialist working with Kubernetes and Terraform, ensuring project stability and efficiency for Convercus. Join a small, dynamic team in a hybrid work environment.
Cloud & DevOps Engineer at XTEL managing Azure infrastructure and deploying applications. Collaborating within an international team to drive technological excellence.