Why AI needs a new kind of supercomputer network — the OpenAI Podcast Ep. 18

OpenAI · Beginner ·🧠 Large Language Models ·1w ago

Skills: LLM Engineering80%Systems Design Basics60%

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs. They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use. Chapters 00:00 Intro 00:39 Greg and Mark's paths to OpenAI 04:34 Why training AI stresses networks differently 10:05 Bottlenecks, failures, and the cost of waiting 15:19 How Multipath Reliable Connection works 18:59 A protocol to route around failures 25:05 Why OpenAI is making MRC an open standard 35:09 Could AI compute move to space?

Watch on YouTube ↗ (saves to browser)

Sign in to unlock AI tutor explanation · ⚡30

More on: LLM Engineering

View skill →

Build an LLM and RAG-based Chat Application using AlloyDB and LangChain

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

FULLY LOCAL Mistral AI PDF Processing [Hands-on Tutorial]

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Ultimate Guide: Deploy Google ADK Agents to Vertex AI & Cloud Run (Step-by-Step Tutorial)

Shane | LLM Implementation

How to Make an Asteroids Game Bot (LIVE)

How to Make an Asteroids Game Bot (LIVE)

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Using Claude Code + Nano Banana Pro To Create a Dataset of Engineering Drawings

Automata Learning Lab

Related AI Lessons

Build AI Compliance SaaS with RAG

Build a scalable AI-powered compliance monitoring SaaS with RAG and regulatory alerts to help businesses stay on top of regulatory changes

How We Cut LLM API Costs by 94%: A 3-Layer Caching Strategy

Cut LLM API costs by 94% using a 3-layer caching strategy without sacrificing quality or performance

Evaluating LLMs for Under a Dollar

Learn to evaluate LLMs effectively for under $1, a crucial step in model development

Dev.to · Thokozani Buthelezi

I Asked AI to Teach Algebra. The First Result Was Slop. Here’s How We Fixed It.

Learn how to improve AI-generated educational content by refining prompts and fine-tuning models, as demonstrated by a project to create an AI-generated algebra course

Medium · Machine Learning

Chapters (8)

Intro

0:39 Greg and Mark's paths to OpenAI

4:34 Why training AI stresses networks differently

10:05 Bottlenecks, failures, and the cost of waiting

15:19 How Multipath Reliable Connection works

18:59 A protocol to route around failures

25:05 Why OpenAI is making MRC an open standard

35:09 Could AI compute move to space?

5 Levels of AI Agents - From Simple LLM Calls to Multi-Agent Systems

Dave Ebbelaar (LLM Eng)