Machine Learning Engineer

  • Full Time
  • Anywhere

Bagel Labs



We are Bagel – a frontier research collective engineering the backbone of a decentralized, open-source AI economy.

Role Overview

You will architect and optimize distributed inference systems for large language models. Your focus is on building scalable, fault-tolerant infrastructure that can serve open-source models like Llama, DeepSeek etc. across multiple nodes and regions, with efficient LoRA adaptation support.

Key Responsibilities

– Design and implement distributed inference systems using vLLM across multiple nodes and regions.

– Architect high-availability clusters with automatic failover and load balancing.


– Build monitoring and observability systems for distributed inference (latency, throughput, GPU utilization).

– Integrate with open-source model serving frameworks (DeepSeek, Text Generation Inference) in a distributed setting.


– Design and optimize LoRA adaptation pipelines for efficient model fine-tuning and serving.



– Document designs, review code, and post clear write-ups on blog.bagel.net.


Who You Might Be


You are extremely curious.

You have a deep understanding of distributed systems and transformer inference. You enjoy architecting scalable infrastructure and optimizing every layer of the serving stack.

You’re excited about making open-source models production-ready at scale and love diving into the internals of distributed model serving frameworks and efficient adaptation techniques.



Required Skills

– At least 5 years of experience with distributed systems and production model serving.


– Hands-on experience with distributed vLLM, Text Generation Inference, or similar frameworks.

– Deep understanding of distributed systems concepts (consistency, availability, partitioning).


– Experience with container orchestration (Kubernetes) and service mesh technologies.

– Proven record of optimizing distributed inference latency and throughput.


– Experience with GPU profiling and optimization in a distributed setting.

– Strong understanding of LoRA and efficient fine-tuning techniques.



Bonus Skills

– Contributions to open-source distributed model serving frameworks.


– Experience with multi-region deployment and global load balancing.

– Knowledge of distributed model quantization and sharding techniques.


– Experience with dynamic LoRA switching and multi-adapter serving.

– Talks or posts that explain distributed inference optimization in plain language.



What We Offer

– A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.


– Full remote flexibility within North American time zones.

– Ownership of work that can set the direction for decentralized AI.


– Paid travel opportunities to the top ML conferences around the world.

To apply, please visit the following URL: