Comparing 5 Pioneering Robotics Foundation Models for ML-Based Control

--

Robotics Foundation Models have rapidly advanced in recent years, dramatically improving robots’ perception capabilities and ability to handle long-horizon tasks. Where traditional robot control often required extensive hand-crafted programming — or was simply too complex to be feasible — emerging techniques like Vision-Language-Action (VLA) models and diffusion-based action generation are now enabling robots to demonstrate feats that were previously out of reach.

In this article, I’ll compare 5 frameworks that I believe capture these ideas in distinct ways:

  1. π0 (PI0)
  2. OpenVLA
  3. Octo
  4. RT-2
  5. Diffusion Policy (DP)

All of these are influential machine-learning–based robotics control methods, yet each relies on different underlying models and data-output processes. This article offers a concise reference on their key characteristics, helping us see where robotics foundation models may be headed in the future.

Why These 5 Models?

I chose these particular methods because each one represents a unique intersection of large-scale data, transformer-based architectures, and either discrete or continuous action generation:

  1. π0 (PI0)
    - Key Idea
    : Uses “PaliGemma” (a pre-trained Vision-Language Model) plus a flow-matching technique (similar to a diffusion process) to output smooth, continuous actions, aligning the data distribution’s “flow” over time.
    - Why It Stands Out: It’s known for notable zero-shot performance, especially on long-horizon tasks such as folding laundry or making coffee, which has drawn significant attention.
    - Reference: π0 Blog
  2. OpenVLA
    - Key Idea
    : Builds on LLaMA 2 for language + vision and outputs discrete tokens for robot actions.
    - Why It Stands Out: Excels at multi-task manipulation and can be fine-tuned quickly (e.g., using LoRA) even on consumer GPUs.
    - Reference: OpenVLA Website
  3. Octo
    - Key Idea
    : Combines transformers and diffusion to generate continuous actions, trained on ~800k multi-robot demos.
    - Why It Stands Out: Hits ~10–15 Hz in real-time, handles multi-task scenarios, and supports text or goal-image inputs.
    - Reference: Octo Models
  4. RT-2
    - Key Idea
    : Adapts a large proprietary VLM (PaLM-E, PaLI-X) to interpret text + images as discrete action tokens.
    - Why It Stands Out: Great at semantic commands (“pick up the extinct animal”).
    - Reference: RT-2 Website
  5. Diffusion Policy (DP)
    - Key Idea
    : Uses a diffusion-based process to produce continuous action trajectories, generally one model per task.
    - Why It Stands Out: Strong on precise tasks, often in data-limited settings (e.g., flipping mugs, pouring sauce).
    - Reference: Diffusion Policy

Comparison Table

Let me share a short table that outlines how these methods differ in action outputs, command handling, architecture, and so on.

Key Observations

  1. Continuous vs. Discrete
    - π0, Octo, DP
    → continuous trajectories.
    - OpenVLA, RT-2 → discrete tokens for each action step.
  2. Command & Language Input
    - π0, OpenVLA, RT-2
    all accept text plus visual context, thanks to large language models.
    - Octo allows text or goal images but doesn’t rely on a big pre-trained LLM.
    - DP typically doesn’t do text commands; it encodes tasks during training.
  3. Real-Time Frequency
    - π0
    can reach about 50 Hz, excellent for high-speed manipulation.
    - OpenVLA, Octo, DP average around 5–15 Hz.
    - RT-2 is ~1 Hz, mainly due to the massive size of the backbone models.
  4. Long-Horizon Tasks
    - π0 and Octo
    have shown multi-step chores (laundry, making coffee).
    - OpenVLA, RT-2 mostly do single-instruction tasks (stacking cups, picking objects).
    - DP can do multiple steps within a single domain but doesn’t chain tasks by text.

Closing Thoughts

Hopefully, this comparison clarifies some of the key trends shaping ML-based robotics. Which approach stands out to you, and why? Feel free to share your perspective or experiences in the comments below!

Telexistence Inc. is a Japan-based robotics startup blending AI-based automation with real-time remote operation, now focusing on robotics AI foundation models. If you’re interested in collaborating or joining us on this journey, reach out to me on LinkedIn.

If you found this article helpful, please share it with colleagues or communities looking to push the boundaries of advanced robotics control!

References & Official Pages

--

--

Genki Sano (Co-founder & CTO, Telexistence Inc.)
Genki Sano (Co-founder & CTO, Telexistence Inc.)

Written by Genki Sano (Co-founder & CTO, Telexistence Inc.)

Co-founder & CTO of Telexistence Inc. Roboticist to make End-to-End system for AI x Teleoperation x Robotics. https://www.linkedin.com/in/genkisano/

No responses yet