Great work! Are you considering a comparison with ASFT Loss [https://github.com/zhuchichi56/ASFT]?
I noticed that your paper compares SFT, SFT_kl, and DFT. ASFT demonstrated that DFT_kl performs quite well in their experiments. Since both works explore forgetting mitigation in SFT, I was wondering if a comparison with ASFT might provide additional insights. Of course, this is just a suggestion based on curiosity – would love to hear your thoughts on whether such a comparison would be relevant to your work.
Looking forward to your insights!
Great work! Are you considering a comparison with ASFT Loss [https://github.com/zhuchichi56/ASFT]?
I noticed that your paper compares SFT, SFT_kl, and DFT. ASFT demonstrated that DFT_kl performs quite well in their experiments. Since both works explore forgetting mitigation in SFT, I was wondering if a comparison with ASFT might provide additional insights. Of course, this is just a suggestion based on curiosity – would love to hear your thoughts on whether such a comparison would be relevant to your work.
Looking forward to your insights!