Hybrid LLM Ops Engineer

Posted 3 hours ago

Apply now

About the role

  • AIOps/LLMOps Engineer at Parspec designing and managing AI infrastructure. Helping transform the construction materials supply chain by building AI-powered infrastructure.

Responsibilities

  • Design and build document AI platforms powered by generative AI, leveraging asynchronous architectures for scalable inference.
  • Implement event-driven and queue-based systems to support elastic scaling and non-blocking AI workflows.
  • Architect and maintain self-hosted LLM infrastructure using tools such as vLLM or Ollama on Kubernetes or EC2 with GPU orchestration.
  • Manage production systems for LLM serving, inference pipelines, and AI workflow orchestration.
  • Implement LLM gateways and routing systems (e.g., LiteLLM, Portkey) to ensure proper model usage and governance.
  • Develop guardrails and monitoring systems to reduce hallucinations, misuse, and unsafe outputs in generative AI systems.
  • Implement end-to-end observability for AI/ML pipelines using distributed tracing and monitoring tools.
  • Monitor AI system health using platforms such as OpenTelemetry, AWS X-Ray, Prometheus, and Grafana.
  • Track performance metrics including latency, token usage, inference quality, and model drift.
  • Manage machine learning workflows using tools such as MLflow, Kubeflow, or SageMaker MLFlow setups.
  • Enable experiment tracking, model versioning, and deployment pipelines for production AI systems.
  • Work closely with engineering teams to integrate AI workflows into scalable backend systems.
  • Implement AI platform security controls including Bedrock Guardrails, KMS encryption, IAM least-privilege policies, VPC endpoints, and CloudTrail auditing.
  • Optimize AWS infrastructure—including Bedrock, SageMaker, and EKS—for cost efficiency, performance, and reliability.
  • Ensure production AI systems maintain high availability and security standards.

Requirements

  • Strong experience with AWS cloud infrastructure including services such as EC2, Lambda, S3, EKS, Bedrock, Step Functions, API Gateway, EventBridge, and SQS/SNS.
  • Experience building ML infrastructure using Infrastructure-as-Code tools such as Terraform or CloudFormation.
  • Hands-on experience deploying and operating LLM serving infrastructure using platforms such as vLLM or Text Generation Inference.
  • Experience managing vector databases and retrieval systems such as Pinecone, PGVector, or Weaviate.
  • Strong experience designing event-driven or asynchronous systems using queues (SQS, Kafka) and micro-batching patterns.
  • Experience implementing observability and monitoring for distributed AI systems using tools such as ELK, Prometheus, Grafana, and OpenTelemetry.
  • Strong programming experience in Python, including frameworks such as FastAPI and asynchronous programming patterns (asyncio).
  • Experience with Docker, Kubernetes, and CI/CD pipelines using tools such as GitHub Actions or ArgoCD.
  • 5+ years of experience in MLOps, LLMOps, AIOps, or DevOps supporting machine learning or AI systems.
  • Proven track record building production generative AI systems with high availability and scalability.
  • Experience deploying self-hosted LLMs on AWS infrastructure and building production-grade document AI platforms.
  • Experience operating AI systems with >99.9% uptime and cost-efficient infrastructure management.

Benefits

  • Competitive salary and benefits, including family insurance coverage
  • Free health teleconsultations
  • Learning/upskilling budgets
  • Equity in the company
  • Flexible hours and a hybrid work setup
  • Unlimited PTO
  • Opportunity to grow with a fast-scaling company transforming a large market

Job title

LLM Ops Engineer

Job type

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job