Software Engineer developing scalable tools and infrastructure for distributed AI applications while optimizing cloud performance. Collaborate with experts to enhance the Anyscale platform.
Responsibilities
Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
Optimize control plane components for large-scale, distributed AI/ML workloads
Build intelligent scheduling and resource management systems for heterogeneous compute clusters
Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
Support and optimize accelerator integration (e.g., GPUs, TPUs).
Handle container image management and dependency resolution for distributed workloads
Participate in code reviews, design and architecture discussions
Provide on-call support, working closely with customer and field teams to troubleshoot infrastructure issues
Collaborate with leading distributed systems and machine learning experts to push the boundaries of AI infrastructure
Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
3+ years of experience writing high-quality production code
Hands-on experience in building and maintaining highly available, scalable, and performant distributed system
Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
Deep understanding of networking, security, and authentication mechanisms in cloud environment
Familiarity with observability stacks (Prometheus, Grafana etc)
Proficiency in Go and Python
Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)
Senior Research and Development Engineer for transformer mechanical design at Hitachi Energy. Leading software development for innovative projects and collaborating within a global team.
Platform Engineer leading lifecycle management of MOM and AMHS systems across Kubernetes clusters in semiconductor industry. Collaborating with internal teams to ensure operational reliability in manufacturing.
Own product platform and release - quality systems for AI SaaS startup. Implement analytics, build dashboards, and ensure safe releases while maintaining high quality standards.
Principal Cloud Security Design Engineer defining and engineering cloud security architecture. Leading technical initiatives in Azure and AWS environments for financial services company.
Mid - level Platform Engineer for FAA modernization project at OCH Technologies. Responsible for designing, implementing, and managing secure automated platform environments supporting aviation systems.
Hands - on engineer designing, building, and maintaining core backend systems at MyFunded Futures. Leading technical architecture and mentoring the engineering team in a fintech environment.
Software Engineer developing advanced trading applications for professional derivatives traders at TT. Collaborate with the team to enhance the award - winning trading platform.
Senior Platform Engineer helping design, scale, and harden Pivotal’s AI - driven platform. Collaborating closely with engineering teams to improve reliability, security, and scalability.
Senior technical authority at Smarsh managing large - scale distributed data platforms. Leading architectural design, influencing engineering standards, and mentoring engineers across the organization.