Hybrid Principal AI Infrastructure Solution Architect

Posted 2 weeks ago

Apply now

About the role

  • Solution Architect developing comprehensive AI infrastructure solutions for deployment at d-Matrix. Collaborating with clients to enable successful integration of d-Matrix based solutions.

Responsibilities

  • Develop end-to-end AI infrastructure reference solutions optimized for d-Matrix servers including compute, networking, storage, and orchestration layers, in collaboration with various internal teams.
  • Create reference blueprints that integrate smoothly into cloud-native and on-prem environments.
  • Develop infrastructure-as-code templates and examples using Ansible, Terraform, and Helm for provisioning d-Matrix-based nodes and clusters.
  • Integrate with Kubernetes-based systems to enable model deployment, auto-scaling, and fault-tolerant execution.
  • Design and deploy telemetry and monitoring frameworks to support real-time visibility into d-Matrix cluster health, job status, and system performance.
  • Integrate with industry-standard observability stacks (e.g., Prometheus, Grafana, OpenTelemetry) for data collection, visualization, and alerting.
  • Develop dashboards, health check systems, and metric pipelines that track performance, availability, and operational KPIs
  • Collaborate with performance and software teams to validate infrastructure using real-world workloads and benchmarks.
  • Incorporate telemetry hooks for benchmark reporting and feedback-driven tuning.
  • Create and publish detailed infrastructure deployment guides, monitoring configuration templates, and operational best practices.
  • Collaborate with customers and OEM/ISV ecosystem, enable them to adopt and customize reference solutions to their specific datacenter environments and/or software stacks.

Requirements

  • Bachelor's or Master’s degree in Computer Science, or related technical field
  • 10+ years of experience in infrastructure solution architecture, systems management, DevOps, or platform engineering roles.
  • Experience working with GPUs, custom AI accelerators or heterogeneous compute environments.
  • Proven expertise in building, managing, and monitoring full-stack AI infrastructure at scale.
  • Strong scripting/automation skills: Python, Bash, Ansible, Terraform, Helm, Docker/Kubernetes.
  • Deep understanding of orchestration technologies (Kubernetes, Ray, KServe, etc.), containerization, server clusters, multi-tenant serving, etc.
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, etc.)
  • Strong skills in scripting and automation (e.g., Python, Bash, Ansible, Terraform, Helm).
  • Familiarity with model serving and orchestration platforms (e.g., Triton Inference Server, Ray Serve, Kubeflow).
  • Strong system debugging and incident response skills.
  • Outstanding collaboration and communication skills.

Benefits

  • Offers Equity
  • Offers Bonus
  • Medical/Dental/Vision/401k
  • Comprehensive benefits centered around employee wellbeing

Job title

Principal AI Infrastructure Solution Architect

Job type

Experience level

Lead

Salary

$175,000 - $260,000 per year

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job