KD for Vision Transformer
Published:
knowledge Distillation
Knowledge distillation has became a common trick to obtain better results for student models in both industry and academia. Student models learn implict information by imitate the logit output and intermediate layer of teacher model. This aiticle summarize and demonstrate my work on “Transformer model distillation” in IDL, Baidu Research from 2022-04-13 to 2022-07-10. (As Prof. Wang saids, my work is more like a survey than a conference paper.😢 )
A very important issue in knowledge distillation is how to tranfer information from teacher models to student models. In CNN, we can align their logit output as well as intermediate features, we can also do that in transformer, but we want to align teacher and student in a finer grain manner and going deep with the information flows between different layers.
Tranformer
Transformer framework