RoboFinals Industrial Benchmark

How Early Adopters Are Scaling Model Evaluation for Physical AI

Evaluation Is Becoming
the Bottleneck of Physical AI

Robotics foundation models are advancing rapidly. Teams are training increasingly capable policies across diverse tasks, robots, and environments. Yet as these systems scale, one challenge is becoming clear: evaluation is now the primary bottleneck in robotics development.

Many robotics labs are encountering the same pattern. Their models have outgrown nearly all existing academic simulation benchmarks. Models easily surpass these benchmarks, yet teams still lack reliable ways to measure capability, track progress, or compare approaches at the frontier.

In response, teams often fall back to real-world testing. But unlike autonomous driving, robotics has no “shadow mode” equivalent. Meaningful evaluation requires hundreds of physical setups, continuous maintenance, and strict safety procedures. As models scale, this approach quickly becomes impractical.
Evaluation is no longer a downstream validation step.

Training builds capability. Evaluation defines progress.

For Physical AI to advance systematically, robotics teams need scalable infrastructure to measure real improvements across tasks, robots, and environments.

That is the goal of RoboFinals.

Early Adopters Are Already
Using RoboFinals

A growing number of robotics teams are adopting Lightwheel RoboFinals as their evaluation platform.

Across foundation models, humanoid robotics, and industrial automation, RoboFinals is emerging as a shared infrastructure for systematically measuring robot capability.

Examples include:

Qwen, which uses RoboFinals to run large-scale evaluations of embodied foundation models across diverse tasks and environments.
Fourier, which evaluates humanoid robot policies under complex interaction scenarios.
RoboForce, which stress-test industrial robotic policies before deployment.
Peritas, which applies RoboFinals for safety-critical validation in medical robotics systems.

For these teams, RoboFinals provides something robotics has long lacked: a scalable and repeatable way to evaluate systems before they reach the real world.

RoboFinals-100: An Industrial
Benchmark for Embodied AI

At the core of RoboFinals is RoboFinals-100, a benchmark designed for industrial-scale robotics evaluation.

Unlike traditional academic benchmarks that emphasize narrow tasks or simplified environments, RoboFinals-100 focuses on:

progressive difficulty
high task diversity
industry-aligned realism

The benchmark spans multiple real-world domains, including:

household tasks such as cleaning and organizing
factory tasks involving part handling and assembly
retail scenarios including sorting and restocking

RoboFinals-100 also provides comprehensive interaction coverage, including:

rigid objects
articulated systems such as cabinets and appliances
deformable materials including cables, cloth, and liquids

These environments are built on Lightwheel’s SimReady asset ecosystem, ensuring consistent physical behavior across tasks.

The benchmark also supports cross-robot evaluation, enabling models to be tested across:

tabletop manipulators
mobile manipulators
full loco-manipulation systems

This allows robotics teams to evaluate models under conditions closer to real-world deployment.

AutoDataGen: Automated Synthetic
Data Generation

Scaling robotics evaluation also requires scalable ways to generate action data for benchmark tasks.

To support this workflow, Lightwheel developed AutoDataGen, an automated simulation data generation pipeline built on top of NVIDIA Isaac Lab.

Its key capabilities include:

LLM-driven task decomposition: Decomposes high-level tasks into atomic skills starting from task code, scene configurations, or natural-language descriptions
Isaac Lab integration: Built as an additional package on Isaac Lab with minimal intrusive changes to existing projects
Unified abstraction: AutoDataGen provides a consistent interface for the entire workflow, making it easy to extend and reuse
Pluggable modules: Custom decomposers, skills, and action adapters can be swapped in through the registration system

AutoDataGen also integrates with LW-BenchHub, Lightwheel’s scenario and task library, enabling automatic decomposition and execution of benchmark tasks across diverse environments.

Several robotics teams are already using AutoDataGen to automatically generate action data for benchmark tasks before running RoboFinals evaluations.

Together, AutoDataGen provides the automated data generation layer that supports large-scale robotics benchmarking workflows.

Infrastructure That Scales:
The RoboFinals Evaluation Stack

Large-scale robotics evaluation requires more than a benchmark. It requires scalable infrastructure.

NVIDIA Isaac Lab-Arena is an open-source framework, available on GitHub, that provides a collaborative system for large-scale robot policy evaluation and benchmarking in simulation, with the evaluation and task layers designed in close collaboration with Lightwheel. RoboFinals is built on NVIDIA Isaac Lab-Arena, Isaac Lab-Arena decouples three core components of evaluation:

environments
robots
tasks

This modular LEGO-like architecture enables consistent evaluation across diverse robots and simulation setups.

Lightwheel extends Arena to support:

complex task logic
long-horizon workflows
generalized evaluation protocols for robotics foundation models

To scale evaluation further, RoboFinals integrates with NVIDIA OSMO, NVIDIA’s orchestration platform for distributed AI workloads.

OSMO automates evaluation workflows by managing experiment execution, task scheduling, and distributed policy rollouts across compute clusters.

Combined with scalable cloud GPU environments, including deployments on Nebius GPU clusters, RoboFinals enables thousands of benchmark episodes to run in parallel.

Together, these components form the RoboFinals Evaluation Stack:

This architecture transforms evaluation from a slow, manual process into automated infrastructure for robotics development.

Toward the ImageNet of
Robotics Evaluation

In the history of AI, shared benchmarks have repeatedly accelerated entire fields.

Datasets such as ImageNet provided a common evaluation framework that allowed researchers to measure progress, compare approaches, and identify real breakthroughs.

Physical AI now faces a similar inflection point.

As robotics foundation models scale, the field needs a shared benchmark to measure capability across tasks, robots, and environments. Without this infrastructure, progress becomes difficult to measure and comparisons remain ambiguous.

RoboFinals is designed to fill this gap.

By combining:

SimReady simulation environments
RoboFinals-100 benchmark tasks
NVIDIA Isaac Lab-Arena evaluation framework
NVIDIA OSMO-powered orchestration

RoboFinals enables systematic, reproducible evaluation at foundation-model scale.
Our ambition is clear:

RoboFinals aims to become the ImageNet of robotics evaluation.

A shared infrastructure where robotics teams can measure progress, compare systems, and accelerate the development of Physical AI.

When evaluation becomes standardized, innovation accelerates.

Evaluation Is Becoming
the Bottleneck of Physical AI

Early Adopters Are Already
Using RoboFinals

Qwen, which uses RoboFinals to run large-scale evaluations of embodied foundation models across diverse tasks and environments.
Fourier, which evaluates humanoid robot policies under complex interaction scenarios.
RoboForce, which stress-test industrial robotic policies before deployment.
Peritas, which applies RoboFinals for safety-critical validation in medical robotics systems.

For these teams, RoboFinals provides something robotics has long lacked: a scalable and repeatable way to evaluate systems before they reach the real world.

RoboFinals-100: An Industrial
Benchmark for Embodied AI

progressive difficulty
high task diversity
industry-aligned realism

The benchmark spans multiple real-world domains, including:

household tasks such as cleaning and organizing
factory tasks involving part handling and assembly
retail scenarios including sorting and restocking

RoboFinals-100 also provides comprehensive interaction coverage, including:

rigid objects
articulated systems such as cabinets and appliances
deformable materials including cables, cloth, and liquids

tabletop manipulators
mobile manipulators
full loco-manipulation systems

This allows robotics teams to evaluate models under conditions closer to real-world deployment.

AutoDataGen: Automated Synthetic
Data Generation

LLM-driven task decomposition: Decomposes high-level tasks into atomic skills starting from task code, scene configurations, or natural-language descriptions
Isaac Lab integration: Built as an additional package on Isaac Lab with minimal intrusive changes to existing projects
Unified abstraction: AutoDataGen provides a consistent interface for the entire workflow, making it easy to extend and reuse
Pluggable modules: Custom decomposers, skills, and action adapters can be swapped in through the registration system

Infrastructure That Scales:
The RoboFinals Evaluation Stack

environments
robots
tasks

This modular LEGO-like architecture enables consistent evaluation across diverse robots and simulation setups.

Lightwheel extends Arena to support:

complex task logic
long-horizon workflows
generalized evaluation protocols for robotics foundation models

This architecture transforms evaluation from a slow, manual process into automated infrastructure for robotics development.

Toward the ImageNet of
Robotics Evaluation

SimReady simulation environments
RoboFinals-100 benchmark tasks
NVIDIA Isaac Lab-Arena evaluation framework
NVIDIA OSMO-powered orchestration

RoboFinals Industrial Benchmark

How Early Adopters Are Scaling Model Evaluation for Physical AI

Evaluation Is Becomingthe Bottleneck of Physical AI

Early Adopters Are Already Using RoboFinals

RoboFinals-100: An Industrial Benchmark for Embodied AI

AutoDataGen: Automated Synthetic Data Generation

Infrastructure That Scales: The RoboFinals Evaluation Stack

Toward the ImageNet of Robotics Evaluation

RoboFinals Industrial Benchmark

How Early Adopters Are Scaling Model Evaluation for Physical AI

Evaluation Is Becomingthe Bottleneck of Physical AI

Early Adopters Are Already Using RoboFinals

RoboFinals-100: An Industrial Benchmark for Embodied AI

AutoDataGen: Automated Synthetic Data Generation

Infrastructure That Scales: The RoboFinals Evaluation Stack

Toward the ImageNet of Robotics Evaluation

Evaluation Is Becoming
the Bottleneck of Physical AI

Early Adopters Are Already
Using RoboFinals

RoboFinals-100: An Industrial
Benchmark for Embodied AI

AutoDataGen: Automated Synthetic
Data Generation

Infrastructure That Scales:
The RoboFinals Evaluation Stack

Toward the ImageNet of
Robotics Evaluation

Evaluation Is Becoming
the Bottleneck of Physical AI

Early Adopters Are Already
Using RoboFinals

RoboFinals-100: An Industrial
Benchmark for Embodied AI

AutoDataGen: Automated Synthetic
Data Generation

Infrastructure That Scales:
The RoboFinals Evaluation Stack

Toward the ImageNet of
Robotics Evaluation