Hybrid Machine Learning Intern – Dynamic KV-Cache Modeling for Efficient LLM Inference

Posted 3 weeks ago

Apply now

About the role

  • Machine Learning Intern at D-Matrix developing dynamic KV-Cache for efficient LLM inference, working with advanced memory techniques.

Responsibilities

  • Research and analyze existing KV-Cache implementations used in LLM inference, particularly those utilizing lists of past-key-values PyTorch tensors.
  • Investigate “Paged Attention” mechanisms that leverage dedicated CUDA data structures to optimize memory for variable sequence lengths.
  • Design and implement a torch-native dynamic KV-Cache model that can be integrated seamlessly within PyTorch.
  • Model KV-Cache behavior within the PyTorch compute graph to improve compatibility with torch.compile and facilitate the export of the compute graph.
  • Conduct experiments to evaluate memory utilization and inference efficiency on D-Matrix hardware.

Requirements

  • Currently pursuing a degree in Computer Science, Electrical Engineering, Machine Learning, or a related field.
  • Familiarity with PyTorch and deep learning concepts, particularly regarding model optimization and memory management.
  • Understanding of CUDA programming and hardware-accelerated computation (experience with CUDA is a plus).
  • Strong programming skills in Python, with experience in PyTorch.
  • Analytical mindset with the ability to approach problems creatively.

Benefits

  • Medical/Dental/Vision/401k
  • Inclusive rewards plan
  • Professional development opportunities

Job title

Machine Learning Intern – Dynamic KV-Cache Modeling for Efficient LLM Inference

Job type

Experience level

Entry level

Salary

$30 - $59 per hour

Degree requirement

Bachelor's Degree

Tech skills

Location requirements

Report this job

See something inaccurate? Let us know and we'll update the listing.

Report job