Site Reliability Engineer improving reliability and availability of Forcepoint products through automation and operational efficiency. Engaging in incident response and collaborating with development teams.
Responsibilities
Monitor, measure and improve the reliability, availability and scalability of Forcepoint products and infrastructure
Engage in Incident response and participate in post-mortem analysis to investigate root cause and capture contributing factors for remediation
Perform analytics on previous incidents and trend/usage patterns to better predict issues and take proactive actions
Design and build custom tools as needed to support process optimization, challenging the status-quo and improving operational efficiency
Participate in 24*7 rotational shifts & On-Call for handling production operation issues
Identify manual routine operational practices and build robust automation capabilities using code and modern tools
Review and create dashboards/reports for application telemetry and infrastructure health for pro-actively identifying performance constraints and bottlenecks
Monitor product performance and availability, and provide feedback to develop, test, and implement robust monitoring, alerting, and logging solutions.
Work collaboratively with software developers to promote best practices in reliability and operability, including code reviews and architectural discussions.
Participate with stakeholders to monitor our products, ensuring that the products meet architecture & observability design requirements.
Requirements
Strong understanding of cloud-based architecture and operations
Hands-on experience with Amazon Web Services is preferred
Experience in administration/build/management of Linux systems
Foundational understanding of Infrastructure and Platform Technology stacks
Strong understanding of Networking concepts and theories, such as different protocols (TCP/IP, UDP, routing protocols, etc), VLAN configuration, DNS, OSI layers, and load balancing
Understanding of security architecture and certificate management
Working knowledge of Infrastructure and Application monitoring platforms such as Grafana Cloud, Xymon, LibreNMS etc.
Working knowledge of Incident Response and Alerting platforms such as PagerDuty, Opsgenie, XMatters etc.
Understanding of the core DevOps practices (CI/CD pipeline, release management etc.)
Ability to write code using any one modern programming language (Python, JavaScript, Ruby etc.).
Additional scripting skills are preferred
Configuration management platform understanding and experience (Chef/Puppet/Ansible)
Prior experience in Cloud management automation tools (Terraform/CloudFormation etc.)
Experience with source code management software and API automation is crucial
Cloud certifications or equivalent experience is highly regarded
Service availability oriented mindset with a pro-active approach to problem solving.
Possesses the ability and willingness to challenge the status-quo and optimize current procedures and processes
Strong sense of ownership and an ability to drive cross-functional process improvement
Possesses excellent inter-personal, written and verbal communications skills
Analytical and logical approach to problem-solving and a willingness to automate repetitive tasks and reduce manual/re-active workload
Ability and willingness to coach and mentor Team members and colleagues.
DevOps Engineer responsible for internal tooling and API development to enhance deployment and operational efficiency at Genesys Cloud. Build automation to improve service health and scalability.
Site Reliability Engineer focused on designing and maintaining observability solutions for fintech company. Collaborating across teams and automating infrastructure for global payment processing.
Azure Security Engineer working on cloud - based security strategies and implementations for Global Payments. Collaborating with teams to enhance the security posture and mitigate risks.
Release Engineer at Air Apps responsible for optimizing release processes and collaborating with cross - functional teams. Focused on smooth, reliable, and efficient application delivery.
DevOps Engineer responsible for maintaining and optimizing infrastructure at Tenet3. Focused on security, automation, and technical operations within a collaborative team environment.
Site Reliability Engineer II at LexisNexis Risk Solutions building Terraform modules and CI/CD pipelines. Responsible for developing cloud infrastructure and ensuring reliability, security, and observability.
DevOps Engineer supporting cloud modernization for the Department of the Air Force on the Cloud One contract. Involved in systems analysis, security practices, and collaboration with engineering teams.
Journeyman Cloud Operations Engineer maintaining cloud infrastructure across DoD organizations. Supporting DevSecOps and ensuring compliance with security requirements in a high - visibility program.
DevOps Engineer managing cloud - native platforms for Capgemini. Collaborating with development, data/ML, and security teams to deliver scalable solutions on Azure.
Head of IT & DevSecOps at JamLoop, managing internal technology and security improvements. Leading strategy and implementation of cloud infrastructure for efficiency and reliability.