SRE Metrics Analyst Intern improving system reliability through data collection and analysis. Engage with engineering teams to shape metrics strategies for operational excellence.
Responsibilities
Design and implement a comprehensive metrics collection framework that captures key performance indicators (KPIs) related to system reliability and operational efficiency.
Identify relevant metrics and establish methods for collecting, aggregating, and storing data from various sources, including monitoring tools, logs, and databases.
Analyze collected metrics to identify trends, patterns, and anomalies that impact system reliability and performance.
Develop dashboards and visualizations to present data in a clear and actionable manner using tools such as Grafana, Kibana, or Tableau.
Create regular reports on system performance, reliability, incident response times, and other critical metrics for various stakeholders, including technical teams and management.
Provide insights and recommendations based on data analysis to drive continuous improvement initiatives.
Work closely with SRE teams to identify their metric needs and ensure alignment with operational goals.
Collaborate with engineering and operations teams to ensure that metric collection is integrated into development and deployment processes.
Requirements
Enrolled in a degree program in a related major - GPA 3.0 or better
US citizenship required
Ability to obtain and maintain a DoD security clearance
Experience in metrics collection, data analysis, or reporting, preferably in a Site Reliability Engineering or DevOps environment.
Proven experience in working with monitoring and observability tools (e.g., Prometheus, Datadog, New Relic).
Strong understanding of key metrics used in site reliability engineering, including SLIs, SLOs, and SLAs.
Proficiency in data analysis tools and languages (e.g., SQL, Python, R) for data manipulation and reporting.
Experience with data visualization tools (e.g., Grafana, Kibana, Tableau) to create dashboards and reports.
Dev Ops Engineer at DATAGROUP in Rostock managing IT applications and cloud technologies. Collaborating with teams to support client IT transformations in a flexible work environment.
SRE Technical Manager leading reliability engineering teams ensuring performance for Navy IT services. Manage teams, collaborate on automation, and drive continuous improvement in a critical systems environment.
DevOps Engineer responsible for optimizing and securing cloud deployment processes at Axi. Collaborating across technology teams to promote best practices in DevOps methodologies.
Azure Cloud Engineer ensuring safe and scalable cloud environment at Schoologica while contributing to innovative educational solutions with modern cloud technologies.
DevSecOps Engineer responsible for enhancing Thales' secure hosting platforms in public and private clouds. Collaborating with teams to apply modern practices and build resilient infrastructures.
DevOps Engineer specializing in AWS Cloud Infrastructure in a hybrid position. Collaborating within a supportive team to build modern infrastructure for VM - based applications.
Develops high - automation services in Golang or Java within AWS, Kubernetes, and Azure. Supports teams in building secure applications while working in a hybrid environment.
Leading DevOps platform strategy for KIPMI Software's next - generation digital trust products. Collaborating with teams to implement scalable infrastructure and DevSecOps practices.
Join our DevOps team to build and manage GitHub pipelines and cloud - native Azure solutions. Collaborate with teams to drive DevOps best practices and optimize deployments.