SRE Metrics Analyst Intern improving system reliability through data collection and analysis. Engage with engineering teams to shape metrics strategies for operational excellence.
Responsibilities
Design and implement a comprehensive metrics collection framework that captures key performance indicators (KPIs) related to system reliability and operational efficiency.
Identify relevant metrics and establish methods for collecting, aggregating, and storing data from various sources, including monitoring tools, logs, and databases.
Analyze collected metrics to identify trends, patterns, and anomalies that impact system reliability and performance.
Develop dashboards and visualizations to present data in a clear and actionable manner using tools such as Grafana, Kibana, or Tableau.
Create regular reports on system performance, reliability, incident response times, and other critical metrics for various stakeholders, including technical teams and management.
Provide insights and recommendations based on data analysis to drive continuous improvement initiatives.
Work closely with SRE teams to identify their metric needs and ensure alignment with operational goals.
Collaborate with engineering and operations teams to ensure that metric collection is integrated into development and deployment processes.
Requirements
Enrolled in a degree program in a related major - GPA 3.0 or better
US citizenship required
Ability to obtain and maintain a DoD security clearance
Experience in metrics collection, data analysis, or reporting, preferably in a Site Reliability Engineering or DevOps environment.
Proven experience in working with monitoring and observability tools (e.g., Prometheus, Datadog, New Relic).
Strong understanding of key metrics used in site reliability engineering, including SLIs, SLOs, and SLAs.
Proficiency in data analysis tools and languages (e.g., SQL, Python, R) for data manipulation and reporting.
Experience with data visualization tools (e.g., Grafana, Kibana, Tableau) to create dashboards and reports.
DevOps Engineer focusing on deploying high - security on - prem infrastructure and MLOps platforms for mission - critical systems. Collaborating on Kubernetes - based orchestration and machine learning workloads.
Cloud Site Reliability Engineer managing Solace Cloud services across leading cloud providers. Ensuring reliability, handling incidents, and collaborating with customers for operational excellence.
Senior Cloud Site Reliability Engineer ensuring reliability and health of Solace Cloud Services with hands - on cloud operations expertise. Lead incident management and customer support for high - impact environments.
DevOps Engineer designing and operating AWS infrastructure within industrial IoT environments. Working on systems that ensure security, resilience, and end - to - end observability.
Sr. Site Reliability Engineer (SRE) III providing technical solutions for the federal government. Collaborating in a high - performing team focused on reliability and application scalability.
Senior Linux System Engineer developing and maintaining Linux server infrastructure for Th. Geyer GmbH. Collaborating on ERP systems and CI/CD processes while ensuring system performance and security.
Platform Engineer leading the development of cloud application platforms for Allstate. Responsible for cloud infrastructure for ML experimentation and production deployments.
Cloud Platform Engineer (ML DevOps) developing and managing CI/CD pipelines for ML workflows in a leading insurance company. Collaborating with data scientists and ensuring infrastructure security and compliance.