The Core Logic: A Unified Path to Generalist Robots 核心逻辑:通往通用机器人的统一路径

The presentation by Physical Intelligence outlines a vision to create general-purpose robot foundation models. This approach aims to overcome the current fragmented state of robotics, where each application requires custom hardware and software from the ground up. Physical Intelligence的演讲阐述了开发通用机器人基础模型的愿景。该方法旨在解决当前机器人应用碎片化的问题,即每个应用都需要从零开始定制硬件和软件。

By learning from the evolution of language models, the key path involves a combination of diverse data sources and strategic training to build robots that can generalize across tasks and environments. 通过借鉴语言模型的发展经验,关键路径在于结合多样化的数据源和战略性训练,以构建能够在不同任务和环境中泛化的机器人。

Original Video: Chelsea Finn: Building Robots That Can Do Anything 原始视频: 切尔西·芬恩:构建无所不能的机器人

The Recipe for Physical Intelligence具身智能的配方

Large-Scale Real Data大规模真实数据

Curated High-Quality Demos精心策展的高质量演示

Synthetic Data from LLMs来自LLM的合成数据

Generalist Robot Foundation Model通用机器人基础模型

From Fragmentation to Foundation 从碎片化到基础模型

The Problem: Fragmented Silos问题:碎片化的孤岛

Currently, every robot application requires building a full company around it. This means starting from scratch for each specific use case. 当前,每个机器人应用都需要围绕其构建一个完整的公司。这意味着每个特定用例都需从零开始。

One Company Per Task一任务一公司

Separate companies for logistics, lab automation, kitchen bots, etc.物流、实验室自动化、厨房机器人等都需要独立的公司。

Reinventing the Wheel重复造轮子

Each must create new hardware, custom software, unique action primitives, and handle edge cases.每家都必须制作新硬件、定制软件、设计动作原语并处理边缘情况。

Result: Many robot companies fail to bring robots into our daily lives.结果:许多机器人公司未能将机器人真正带入我们的日常生活。

The Vision: A General Purpose Model愿景:一个通用模型

Physical Intelligence aims to build one general-purpose model that allows any robot to perform any task in any environment. Physical Intelligence的目标是构建一个通用模型,使任何机器人能在任何环境中执行任何任务。

The Foundation Model Analogy基础模型类比

Like LLMs, a generalist model trained on vast data is more effective than specialized ones.就像LLM一样,在海量数据上训练的通用模型比专用模型更有效。

Intelligence in the Physical World物理世界中的智能

The goal is to bring this intelligence beyond the digital realm and into our physical reality.目标是将这种智能从数字世界带入我们的物理现实。

Goal: A single, adaptable model powering a universe of robotic applications.目标:一个可适应的模型,驱动所有机器人应用。

The Data Dilemma: Scale is Not Enough 数据困境:规模并非万能

While scale is crucial for foundation models, the type and quality of data are equally important in robotics. Different data sources have unique strengths and critical limitations. 尽管规模对基础模型至关重要,但在机器人领域,数据的类型和质量同样重要。不同的数据源各有其优势和关键局限。

Industrial Automation Data工业自动化数据

Massive scale from repetitive tasks.来自重复性任务的海量数据。

Lacks behavioral diversity.缺乏行为多样性。

YouTube DataYouTube数据

Vast source of diverse human actions.多样化人类行为的巨大来源。

Challenging to use, embodiment gap.使用困难,存在具身形态差距。

Simulation Data模拟数据

Can generate large-scale datasets.可生成大规模数据集。

Lacks realism, reality gap.缺乏真实性,存在现实差距。

Conclusion: Scale is necessary, but not sufficient. 结论:规模是必要的,但并非充分条件。

The Laundry Challenge: A Timeline of Breakthroughs 洗衣挑战:突破的时间线

March 20242024年3月

Starting Simple从简开始

The team began with the simplest task: folding a single, standard-sized T-shirt. The initial model had 100 million parameters and operated at 50Hz.团队从最简单的任务开始:折叠一件标准尺寸的T恤。初始模型有1亿参数,以50Hz的频率运行。

June 20242024年6月

Increasing Difficulty难度升级

The task evolved to folding crumpled shirts, which proved much harder. Early success rates were often 0%. It took until late June for the robot to show initial, slow progress.任务演变为折叠揉皱的衬衫,难度大增。早期成功率常为0%。直到6月底,机器人才展现出初步的、缓慢的进展。

For 2-3 months, progress stalled as they introduced more variables like taking clothes from a basket and adding different types of garments.在接下来的2-3个月里,随着引入从篮中取衣物和不同种类衣物等变量,进展陷入停滞。

The Key Breakthrough: A New Training Recipe关键突破:新的训练配方

Inspired by language modeling, the team developed a crucial two-stage process that unlocked performance.受语言建模启发,团队开发了一个关键的两阶段流程,从而释放了性能。

Pre-train预训练

on ALL data在所有数据上

Fine-tune微调

on CURATED data在策展数据上

High Performance高性能

September 20242024年9月

Scaling Up with Polygeemma使用Polygeemma进行扩展

They integrated an open-source 3 billion parameter Vision Language Model (Polygeemma), a 10x increase in size. This model, combined with the new training recipe, significantly boosted performance and consistency.他们集成了一个开源的30亿参数视觉语言模型(Polygeemma),模型大小增加了10倍。该模型与新的训练配方相结合,显著提升了性能和一致性。

Model Size Comparison模型大小比较

Performance Validation性能验证

Final Results & Generalization最终结果与泛化能力

The robot could now fold 5 items in 9 minutes, down from 20. More importantly, it showed impressive generalization to unseen items and could recover from human interruptions.机器人现在能在9分钟内折叠5件衣物(从20分钟缩短)。更重要的是,它对未见过的物品表现出惊人的泛化能力,并能从人为干扰中恢复。

Key Insight: This pre-train/post-train recipe proved to be a general solution, successfully applied to other tasks (clearing tables, making coffee) and even other companies' robots.核心洞见:这种预训练/后训练的配方被证明是一个通用解决方案,成功应用于其他任务(清理桌面、冲泡咖啡)甚至其他公司的机器人。

Generalization: From the Lab to the Real World 泛化能力:从实验室到真实世界

A key challenge is making robots work in new, unseen environments. The solution lies in collecting highly diverse data.一个关键挑战是让机器人在未见过的新环境中工作。解决方案在于收集高度多样化的数据。

Diverse Data Collection多样化数据收集

Data was gathered from over 100 unique rooms across different homes and simulated kitchens/bedrooms.数据从超过100个不同家庭和模拟厨房/卧室的独特房间中收集。

Improving Language Following提升语言遵循能力

Early models often ignored language commands. By predicting tokenized actions and preventing gradient flow to the VLM backbone, language following rates skyrocketed.早期模型常忽略语言指令。通过预测标记化动作并阻止梯度流向VLM主干,语言遵循率大幅提升。

80%

Language following rate increased from 20% to 80%.语言遵循率从20%提升至80%

Testing in the Wild野外测试

The robot was tested in 3 completely new Airbnb rentals, demonstrating successful generalization.机器人在3个全新的爱彼迎出租屋中进行了测试,展示了成功的泛化能力。

Closing cabinets关闭橱柜
Putting away dishes收拾碗碟
Cleaning spills清理溢出物
Making the bed整理床铺

Current Failure Modes当前失败模式

Success rate is ~80%. Failures include getting stuck, struggling with thin objects, and misinterpreting objects (e.g., oven for a drawer).成功率约为80%。失败包括卡住、难以拾取薄物体以及误解物体(例如将烤箱误认为抽屉)。

"Hey Robot...": Responding to Open-Ended Prompts “嘿,机器人…”:响应开放式指令

The VLAH Model ArchitectureVLAH模型架构

A hierarchical vision-language-action model (VLAH) allows the robot to break down complex prompts into simpler, actionable steps.分层的视觉-语言-动作模型(VLAH)使机器人能够将复杂的指令分解为更简单、可执行的步骤。

User Prompt用户指令

"Can you make me a vegan sandwich?"

High-Level Policy高层策略

Decomposes the task into atomic commands and verbal responses.将任务分解为原子指令和口头回应。

Low-Level Model低层模型

Executes atomic commands by predicting joint angles.通过预测关节角度来执行原子指令。

Solution: Synthetic Data Generation解决方案:合成数据生成

Since collecting vast human-robot interaction data is difficult, they use LLMs to generate synthetic prompts by "reverse-engineering" existing robot action videos.由于收集大量人机交互数据很困难,他们利用LLM通过“逆向工程”现有机器人动作视频来生成合成指令。

Example Interactions交互示例

Complex Reasoning

"Make me a vegan sandwich, but I don't like pickles." -> Robot correctly uses lettuce & tomato, omitting cheese, meat, and pickles.“给我做一个纯素三明治,但我不喜欢酸黄瓜。” -> 机器人正确地使用生菜和番茄,不加奶酪、肉和酸黄瓜。

Interruption & Correction

User interrupts: "Get me a sweet that's NOT in the basket." -> Robot pivots from the Kit Kat it was grabbing and gets Skittles instead.用户打断:“给我一些不在篮子里的甜食。” -> 机器人放弃正在拿的奇巧,转而拿彩虹糖。

Result: Their system vastly outperforms other leading foundation models in instruction following and task progress for robotics.结果:他们的系统在机器人指令遵循和任务进展方面,远超其他领先的基础模型。

Industry Outlook & Q&A Highlights 行业前景与问答精选

Role of Reinforcement Learning强化学习的作用

RL can play a huge role in post-training, using online data from the robot to significantly boost success rates and efficiency.RL可以在后训练中发挥巨大作用,利用来自机器人的在线数据显著提高成功率和效率。

Funding & Market资金与市场

Funding is not a challenge. As the tech matures and starts working in the real world, it's attracting significant capital.融资不成问题。随着技术成熟并在现实世界中开始奏效,它正吸引大量资本。

Infrastructure Needs基础设施需求

Key needs are real-time systems for low-latency control and large-scale ML infrastructure for training giant, multi-modal models.关键需求是用于低延迟控制的实时系统和用于训练大型多模态模型的大规模机器学习基础设施

Future of Synthetic Data合成数据的未来

Real data is irreplaceable and a necessary component. Synthetic/simulated data is especially useful for evaluation in diverse new environments.真实数据不可替代,是必要组成部分。合成/模拟数据在多样化新环境中的评估中尤其有用。

Opportunities for Community社区机遇

Huge opportunities exist in improving robot infrastructure, collecting data, open-sourcing models, and exploring new fine-tuning recipes.在改进机器人基础设施、收集数据、开源模型和探索新微调配方方面存在巨大机遇。

Academia vs. Industry学术界 vs. 工业界

Academia excels at solving algorithmic problems with limited resources, while industry is suited for large-scale data and model research. Both are vital.学术界擅长在有限资源下解决算法问题,而工业界适合大规模数据和模型研究。两者都至关重要。