My Implementation of Streaming Diffusion Policy paper: https://arxiv.org/pdf/2406.04806
Diffusion policies have very nice properties that make them very appealing especially for imitation learning. However, as all diffusion models, they suffer from slow inference time.
Typical diffuson models, given a state (or a trajectory) output a sequence of actions in the future. The agent can then do the entire set of future actions, or a subset of actions. Then, it gets a new state and compute the next set of actions.
Instead of denoising the entire set of actions in the future, Streaming Diffusion Policies (SDPs) maintain a buffer of actions. This buffer is divided into chunks. Each chunk contains multiple actions.
SDPs denoise each chunk N/h times at each timestep, where N is the total number of denoising steps and h is the number of chunks.
In this setup, the first chunk will be denoised N times, the second chunk will be denoised N - N/h, the third N - 2*N/h, etc... Maintaing in memory this buffer of actions, at every step we will have the first chunk always denoised N times, but for each timestep we perofrm only N/h denoising steps (instead of N at every step for standard diffusion policy).
I trained a normal diffusion policiy and a SDP using the LunarLander environment. I evaluated the agents in 10 testing episodes at the end of the training.
Here we show the episodic reward during evaluation. As the plots show, there is no much difference between standard and streaming policies.
However, if we plot the average inference time per episode, we can see that SDP has 2x faster than standard diffusion policies (this with 4 chunks, each chnk with 8 actions, compared to a standard diffusion policy outputting 8 actions)
Here is a video of the policy trained with SDP:
While here is a video of the policy trained with stadard diffusion:




