SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

RA-L'26 & ICRA'27

From the Author

Due to IP policies, we do not release a click-and-run version of SI-Diff. We will provide more supplementary details to the paper to help you reproduce the work.

We first provide a straightforward introduction to the fundamentals of robot control to help readers avoid confusion. The second-order dynamic model of an n-Degree-of-Freedom torque-controlled robot is as follows:

Among these terms, we control the robot by changing $\boldsymbol{\tau}_m$, the joint torque. If the following algorithm is used, the robot is controlled by an impedance controller.

Based on this, if a feedforward force term is added, the controller becomes a feedforward force-based impedance controller, which is the controller used in this work.

Our force diffusion policy learns how to predict the feedforward force.

Note that we rely on the error term e to drive the end effector (EE) to the desired position. In other words, we need to first define a desired position or trajectory. Although the feedforward force can also influence the motion of the EE, we only rely on it to handle misalignment or sticking situations.

Step 1: Impedance Controller

First, you need to build an impedance controller for your robot. If you are using a Franka Robotics robot, you can follow this demo. Once this step is completed, your robot should behave like the one shown in the following video.

595577416-4ef82801-d471-4a69-8b65-04aa87ca3d07.mp4

Step 2: Feedforward-based Impedance Controller

On top of the impedance controller, you need to further add a feedforward force term to the controller. You can start by designing the feedforward force using a simple pattern. For example, you can set fz as a sinusoidal signal and set fx, fy, mx, my, and mz to zero. Then, your robot should behave as shown in the following video.

595577540-e5ae456f-881e-4a4d-90d8-ec9e37ff4f6c.mov

Step 3: Teacher Policy

Follow Algorithm 1 in our paper to design the teacher policy and collect training data. We provide one demonstration (robot_action.pkl & robot_state.pkl) in this repository to show what the training data look like.

Our diffusion policy learns to predict robot action (output) from robot states (input). The action is the 6 DoF feedforward force (fx, fy, fz, mx, my, and mz). The robot state is 37-dimensional: the first value is the mode prompt, and the following 36 dimensions are identical to the observations in TacDiffusion. You can refer to the discussion here for details regarding the 36 dimension values.

Once the teacher policy is ready, the robot can start searching. In the early stages, we manually created misalignments to collect data for the teacher policy. Later, we developed an automated data collection pipeline. It mirrors the evaluation process of the teacher policy, but only records successful demonstrations that meet our efficiency criteria (completed within 2 seconds). We kept running this until a sufficient number of expert demonstrations are collected.

auto_data.mp4

Step 4: Diffusion Policy

Our diffusion policy is built upon Imitating-Human-Behaviour-w-Diffusion and TacDiffusion. We recommend first becoming familiar with these two works, then following the instructions in our paper to add the mode embedding layers.

Step 5: Model Training

Since the model needs to learn two modes simultaneously, and the data distribution between the two modes is imbalanced, we recommend using the BBS technique. The following code briefly illustrates one training iteration process.

for ep in range(n_epoch):
    dataload_train_0.sampler.set_epoch(ep)
    dataload_train_1.sampler.set_epoch(ep)

    model.train()
    optim.param_groups[0]["lr"] = lrate * ((np.cos((ep / n_epoch) * np.pi) + 1) / 2)

    pbar = zip(dataload_train_0, dataload_train_1)
    if rank == 0:
        pbar = tqdm(pbar, total=min(len(dataload_train_0), len(dataload_train_1)), desc=f"Epoch {ep}")

    for (x0, y0), (x1, y1) in pbar:
        # 1. Move tensors to the configured device asynchronously
        x0 = x0.to(device, non_blocking=True).float()
        y0 = y0.to(device, non_blocking=True).float()
        x1 = x1.to(device, non_blocking=True).float()
        y1 = y1.to(device, non_blocking=True).float()

        # 2. Extract the mode prompt from the first dimension (index 0)
        # Input shape: [B, 37] -> mode shape: [B], feature shape: [B, 36]
        mode0 = x0[:, 0].long()       # Cast to long for the embedding layer
        x0_feature = x0[:, 1:]        # Slice the remaining 36 dimensions for observations

        mode1 = x1[:, 0].long()
        x1_feature = x1[:, 1:]

        # 3. Concatenate the dual-source data into a single balanced batch
        x_batch = torch.cat([x0_feature, x1_feature], dim=0)  # Pure 36-dim observations
        y_batch = torch.cat([y0, y1], dim=0)
        mode_batch = torch.cat([mode0, mode1], dim=0)          # Combined mode prompts

        # 4. Forward pass and loss computation
        loss = model.module.loss_on_batch(x_batch, y_batch, mode_batch)
        
        # 5. Backward pass and optimization step
        optim.zero_grad()
        loss.backward()
        optim.step()

        if rank == 0:
            pbar.set_description(f"train loss: {loss.item():.4f}")
            writer.add_scalar('training_loss', loss.item(), global_step)
            global_step += 1

Modifications to FORGE

FORGE is an RL-based STOA competitor in our paper. However, the released FORGE code cannot be directly used to learn peg-in-hole tasks with our objects. In particular, although the authors mention in the paper that a noisy estimate of the fixed part’s 6-DoF pose (which lies in SE(3)) is adopted as input to the model, we found that in their code implementation, the model only utilizes the 3-DoF translation component.

FORGE paper

Released FORGE code

Fingertip refers to the end-effector. Please note that pos does not represent a 6D pose, but rather a 3D position vector, while quat denotes the quaternion. Namely, the released code implementation does not fully reproduce the algorithm described in the paper. This discrepancy is also reflected in how the “key points” are defined in the code. In their implementation, the key points defined on each object all lie on a single line. The training objective is to align the key points defined on the peg (blue) with those defined on the hole (green). Under this definition, orientation becomes meaningless, since two parallel lines (both perpendicular to the ground) differ only by translational error. We experimentally found that while this setup works for the cylinder-like pegs used in the paper, it fails for the cuboid peg, which requires additional constraints to properly guide alignment during insertion.

Therefore, we made the following two modifications to the FORGE code implementation.
First, we revised the definition of key points, from colinear points to the four corners of a square, along with their normal direction. Please refer to the figure above. By comparing the two setups, we can see that orientation information becomes meaningful, as a specific orientation must be followed to align the key points, unlike in the previous setup.
Second, following the algorithm described in the FORGE paper, we added the orientation-related term fingertip_euler_rel_fixed, which represents the noisy estimate of the fixed part’s rotation, and modified the corresponding training and inference procedures accordingly.

Acknowledgments

Parts of this project page were adopted from the Nerfies page. We would like to thank the authors of Imitating-Human-Behaviour-w-Diffusion and TacDiffusion for their open-source contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
static		static
README.md		README.md
index.html		index.html
robot_action_train_demo.pkl		robot_action_train_demo.pkl
robot_state_train_demo.pkl		robot_state_train_demo.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

RA-L'26 & ICRA'27

From the Author

Step 1: Impedance Controller

Step 2: Feedforward-based Impedance Controller

Step 3: Teacher Policy

Step 4: Diffusion Policy

Step 5: Model Training

Modifications to FORGE

FORGE paper

Released FORGE code

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

RA-L'26 & ICRA'27

From the Author

Step 1: Impedance Controller

Step 2: Feedforward-based Impedance Controller

Step 3: Teacher Policy

Step 4: Diffusion Policy

Step 5: Model Training

Modifications to FORGE

FORGE paper

Released FORGE code

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages