
Bagel Labs
We are Bagel – a frontier research collective engineering the backbone of a decentralized, open-source AI economy.
Role Overview
You will architect and optimize distributed inference systems for large language models. Your focus is on building scalable, fault-tolerant infrastructure that can serve open-source models like Llama, DeepSeek etc. across multiple nodes and regions, with efficient LoRA adaptation support.
Key Responsibilities
– Design and implement distributed inference systems using vLLM across multiple nodes and regions.
– Architect high-availability clusters with automatic failover and load balancing.
– Build monitoring and observability systems for distributed inference (latency, throughput, GPU utilization).
– Integrate with open-source model serving frameworks (DeepSeek, Text Generation Inference) in a distributed setting.
– Design and optimize LoRA adaptation pipelines for efficient model fine-tuning and serving.
– Document designs, review code, and post clear write-ups on blog.bagel.net.
Who You Might Be
You are extremely curious.
You have a deep understanding of distributed systems and transformer inference. You enjoy architecting scalable infrastructure and optimizing every layer of the serving stack.
You’re excited about making open-source models production-ready at scale and love diving into the internals of distributed model serving frameworks and efficient adaptation techniques.
Required Skills
– At least 5 years of experience with distributed systems and production model serving.
– Hands-on experience with distributed vLLM, Text Generation Inference, or similar frameworks.
– Deep understanding of distributed systems concepts (consistency, availability, partitioning).
– Experience with container orchestration (Kubernetes) and service mesh technologies.
– Proven record of optimizing distributed inference latency and throughput.
– Experience with GPU profiling and optimization in a distributed setting.
– Strong understanding of LoRA and efficient fine-tuning techniques.
Bonus Skills
– Contributions to open-source distributed model serving frameworks.
– Experience with multi-region deployment and global load balancing.
– Knowledge of distributed model quantization and sharding techniques.
– Experience with dynamic LoRA switching and multi-adapter serving.
– Talks or posts that explain distributed inference optimization in plain language.
What We Offer
– A deeply technical culture where bold, frontier ideas are debated, stress-tested, and built.
– Full remote flexibility within North American time zones.
– Ownership of work that can set the direction for decentralized AI.
– Paid travel opportunities to the top ML conferences around the world.