AMD Open Robotics Hackathon

Mid-December 2025, I spent 72 hours immersed at Station F for the "AMD Open Robotics Hackathon". The timing couldn't have been better: we are precisely at the inflection point of "physical intelligence" applied to robotics.

The goal was to prototype a real use case in record time, leveraging open-hardware robot arms and AMD's computing power. We finished 3rd, but beyond the podium, I mainly take away a deep understanding of what's currently at stake in the field.

The Setup: Leader, Follower and LeRobot

The hackathon provided an ideal environment for imitation learning via Hugging Face's LeRobot library. Each team had two arms: a "leader" for human teleoperation, and a "follower" that replicates the movements, all monitored by a "top view" camera and a second one mounted on the gripper.

To validate the production chain (data collection → training → inference), we started with a robotics "Hello World": grabbing an object and placing it in a cup. After only 50 demonstrations and quick fine-tuning of an ACT policy on AMD cloud, the robot performed the task autonomously. Seeing the machine replicate human fluidity so quickly instantly validates the robustness of the technical stack.

The Project: The Surgical Assistant and the VLA Choice

Once the workflow was mastered, we aimed higher with a team of four students (Fabien, Victor, Lucas and myself). Our idea: a surgical assistant capable of identifying and handing over specific tools on demand.

This is where the technical strategy was decisive. We evaluated several architectures. The PI0 model was promising but too heavy for local inference with our hardware. Conversely, a classic ACT policy lacked flexibility. Our choice therefore fell on SmolVLA.

Why is this choice structuring? SmolVLA is a Vision-Language-Action model. Unlike classic approaches that simply map pixels to motors, a VLA integrates a language model (LLM) into the decision loop. This changes everything: the robot no longer just executes a memorized movement, it understands a semantic instruction associated with vision.

This is the future of robotics for one simple reason: generalization. Instead of training a rigid model for "grab the scalpel" and another for "grab the forceps", we train a single brain capable of handling multiple tasks. By telling it "this part of the dataset corresponds to the instruction 'hand over the scalpel'", the model learns to link language to action. It thus becomes capable of generalizing and adapting, making the addition of new tasks much faster and more natural.

In 72 hours, the prototype was functional: a speech-to-text module captures the surgeon's voice, the VLA interprets the request, and the arm executes the precise movement to deliver the tool.

The Fusion Between Bits and Atoms

This experience perfectly illustrates the current paradigm shift. Models like those released by Physical Intelligence the same week prove that if you scale enough, the boundary between "seeing", "understanding" and "doing" fades away.

Until now, robotics was held back by the need to generate proprietary data through hours of expensive teleoperation. The advent of VLAs changes the game: AI is beginning to be able to translate any human video or text instruction into an action engine.

We are witnessing the brutal fusion between the world of bits and that of atoms. Software is no longer content with processing information, it is beginning to absorb hardware to manipulate the real world. It is precisely on this software infrastructure layer that everything will play out in the next 5 years.