AIOps/LLMOps Engineer at Parspec designing and managing AI infrastructure. Helping transform the construction materials supply chain by building AI-powered infrastructure.
Responsibilities
Design and build document AI platforms powered by generative AI, leveraging asynchronous architectures for scalable inference.
Implement event-driven and queue-based systems to support elastic scaling and non-blocking AI workflows.
Architect and maintain self-hosted LLM infrastructure using tools such as vLLM or Ollama on Kubernetes or EC2 with GPU orchestration.
Manage production systems for LLM serving, inference pipelines, and AI workflow orchestration.
Implement LLM gateways and routing systems (e.g., LiteLLM, Portkey) to ensure proper model usage and governance.
Develop guardrails and monitoring systems to reduce hallucinations, misuse, and unsafe outputs in generative AI systems.
Implement end-to-end observability for AI/ML pipelines using distributed tracing and monitoring tools.
Monitor AI system health using platforms such as OpenTelemetry, AWS X-Ray, Prometheus, and Grafana.
Track performance metrics including latency, token usage, inference quality, and model drift.
Manage machine learning workflows using tools such as MLflow, Kubeflow, or SageMaker MLFlow setups.
Enable experiment tracking, model versioning, and deployment pipelines for production AI systems.
Work closely with engineering teams to integrate AI workflows into scalable backend systems.
Implement AI platform security controls including Bedrock Guardrails, KMS encryption, IAM least-privilege policies, VPC endpoints, and CloudTrail auditing.
Optimize AWS infrastructure—including Bedrock, SageMaker, and EKS—for cost efficiency, performance, and reliability.
Ensure production AI systems maintain high availability and security standards.
Requirements
Strong experience with AWS cloud infrastructure including services such as EC2, Lambda, S3, EKS, Bedrock, Step Functions, API Gateway, EventBridge, and SQS/SNS.
Experience building ML infrastructure using Infrastructure-as-Code tools such as Terraform or CloudFormation.
Hands-on experience deploying and operating LLM serving infrastructure using platforms such as vLLM or Text Generation Inference.
Experience managing vector databases and retrieval systems such as Pinecone, PGVector, or Weaviate.
Strong experience designing event-driven or asynchronous systems using queues (SQS, Kafka) and micro-batching patterns.
Experience implementing observability and monitoring for distributed AI systems using tools such as ELK, Prometheus, Grafana, and OpenTelemetry.
Strong programming experience in Python, including frameworks such as FastAPI and asynchronous programming patterns (asyncio).
Experience with Docker, Kubernetes, and CI/CD pipelines using tools such as GitHub Actions or ArgoCD.
5+ years of experience in MLOps, LLMOps, AIOps, or DevOps supporting machine learning or AI systems.
Proven track record building production generative AI systems with high availability and scalability.
Experience deploying self-hosted LLMs on AWS infrastructure and building production-grade document AI platforms.
Experience operating AI systems with >99.9% uptime and cost-efficient infrastructure management.
Benefits
Competitive salary and benefits, including family insurance coverage
Free health teleconsultations
Learning/upskilling budgets
Equity in the company
Flexible hours and a hybrid work setup
Unlimited PTO
Opportunity to grow with a fast-scaling company transforming a large market
Manager overseeing building operations and multiple facilities at Emory University. Interacting with leadership, staff, and vendors for facility management and operations effectiveness.
General Manager overseeing IT operations at Supermicro, leading end - to - end service delivery and operational excellence. Ensuring compliance and managing IT service delivery across the organization.
Mission Operations Lead overseeing complex technical operations across DoD and U.S. Government programs. Providing strategic leadership and operational oversight with a focus on risk management and process optimization.
OpEx Engineer at Morgan Advanced Materials focusing on Lean and Continuous Improvement initiatives. Collaborating across functions to enhance efficiency and reduce costs in manufacturing operations.
Operations Coordinator at Air Apps coordinating daily operations and managing task tracking systems with Jira. Collaborating closely with cross - functional teams to streamline processes in a fast - paced environment.
Manage customer service teams at Humânia, a leader in patient support programs. Ensure operational efficiency and improve customer relations within the healthcare sector.
Executive Assistant - IT Operations role supporting day - to - day operational execution across people and technology functions. Focus on employee lifecycle processes and core IT tasks.
Analista Administrativo JR at Humânia focusing on operational excellence and data insights for health solutions. Responsible for maintaining global indicators and strategic reporting.
Senior Analytics Partner leading analytics for the operations department at Wemolo, focusing on insights and KPI governance. Collaborating with stakeholders and driving data - driven decisions.