Director for SRE supporting Fidelity’s growing public cloud presence and delivering reliable runtimes for business critical workloads. Leading diverse technical teams to enhance cloud management capabilities and customer value.
Responsibilities
The Fidelity Enterprise Infrastructure (EI) Production Support team is seeking a Director to help scale our growing public cloud presence.
Fidelity’s Site Reliability Engineers work with our cloud platform teams to deliver reliable runtimes for Fidelity’s business critical workloads.
This team is responsible for cross-cutting cloud management capabilities and are the experts on the state of Fidelity’s cloud platforms at any moment.
The team comes from diverse technical backgrounds, and the responsibilities provide opportunity for a variety of challenges that require engineers to work on software and systems challenges.
Ideal candidates will have a background in either software engineering or systems engineering with a desire to learn the other or previous experience as an SRE.
The Director for SRE will support Engineering and Systems Operational support for Business Unit aligned functions including Application Support, Cloud Enablement, Helpdesk, Environment Management, Mid-tier & Web Operations, & Platform Engineering.
By demonstrating and promoting Fidelity and agile leadership behaviors, you will evolve and sustain an innovative agile culture.
Our ever-evolving technology stack ensures a phenomenal learning culture in the team.
We are always exploring new technologies and new ways to continually provide value to our customers.
This team has a direct and positive impact on Fidelity’s customers.
Requirements
Ability to automate with various scripting languages (Python, Shell scripting, etc.)
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Hands-on Kubernetes skills and knowledge.
Hands on experience with Cloud services on AWS and Azure
Experience on building resiliency with Chaos Engineering practices
Hands on experience with one or more observability tools (Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, etc.)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Proven experience in maintaining scalability and resiliency of complex environment.
Proven experience in implementing advanced observability practices and techniques at scale.
Demonstrated ability to utilize modern monitoring tools (DataDog, Prometheus, Splunk)
Experienced in Instrumentation with systems skills on building and operating, monitoring, logging, alerting services of distributed systems at scale.
Ability to triage, execute root cause analysis, and be decisive under pressure.
Experience managing and interpreting large datasets using query languages and visualization tools.
Proficient communication skills with an ability to reach both technical and non-technical audience.
Ability to learn new software, method and practices and bringing them to our developers.
Ability to work with a variety of individuals and groups, both in person and virtually, in a constructive and collaborative manner and build and maintain effective relationships.
Bridges the gap between lofty architecture ideas and development of feasible solutions.
Facilitates discussions among component owners to improve end-to-end understanding of transaction paths.
Provides consulting to architects and developers on common patterns and tactical, reusable solutions.
Influences adoption of stability principles by presenting facts and data.
Drives operational readiness discussions and reviews of new solutions and products.
Develops frameworks for self-assessment of applications on various stability and dependability pillars.
Participates, even unsolicited, in discussions and decisions that impact customer experience.
Selectively preserves and shares collective memory and successes of past.
Mindset of continuous learning and experimentation.
Instinctive urge to improve current state by finding problems and recommending feasible solutions.
Benefits
Most roles at Fidelity are Hybrid, requiring associates to work onsite every other week (all business days, M-F) in a Fidelity office. This does not apply to Remote or fully Onsite roles.
Site Reliability Engineer Intern at Tencent working on gaming services and cloud native solutions. Collaborating with global teams to eliminate toil and enhance reliability.
Cloud/DevOps Specialist at N5X managing and optimizing critical cloud infrastructures for Brazilian energy trading. Collaborating with a multidisciplinary team to ensure high availability and performance.
Cloud/Devops Specialist responsible for designing a hybrid architecture combining cloud and on - premises infrastructure for energy trading systems. Collaborating with a multidisciplinary team in a dynamic environment.
Reliability Engineering Specialist utilizing reliability tools and models to improve asset performance at Enbridge. Collaborating across teams to guide investment decisions for safe operations.
DevOps Engineer responsible for structuring and supporting cloud DevOps architecture in Brazil. Working strategically on automation and CI/CD practices with development teams in Pernambuco.
DevSecOps Software Engineer developing secure CI/CD pipelines for Boeing's military software systems. Collaborate with cross - functional teams and implement automation and security best practices.
DevOps Manager responsible for managing a team for multi - cloud solutions supporting the USAF Cloud One project. Focus on scalable cloud - native solutions and CI/CD practices.
Lead Site Reliability Engineer overseeing SRE practices across Azure and GCP platforms. Driving reliability improvements and leading a team at Lloyds Banking Group.
DevOps Engineer responsible for managing Microsoft Intune operations at Bundesdruckerei GmbH. Focused on ensuring secure digital solutions for identity and data protection in Berlin.
Senior Site Reliability Engineer driving observability and reliability for business - critical systems at Incedo. Collaborating with engineering teams to enhance system resilience and performance.