According to the search results, the key improvement made by changing the prefix length from 1 to 4 in the CLIPxGPT Captioner model was an increase in performance:
"In my original idea, the length of the prefix was 1, but after reading publication, the length of the prefix was changed to 4, thanks to which the performance increased." [1]
The paper referenced is likely referring to the ClipCap model, which also uses a CLIP prefix for image captioning. The authors of that paper found that using a prefix length of 4 tokens from the CLIP embedding led to better performance compared to a single-token prefix.
By using a longer prefix, the model is able to capture more of the semantic information encoded in the CLIP embedding, which helps guide the language model to generate more relevant and informative captions. The longer prefix provides richer context for the GPT-2 language model to condition its text generation on.
So in summary, the key improvement from increasing the prefix length from 1 to 4 tokens was a boost in the overall captioning performance of the CLIPxGPT Captioner model. The longer prefix allowed the model to better leverage the visual understanding encoded in the CLIP embedding.
Citations:[1] https://github.com/jmisilo/clip-gpt-captioning
[2] https://arxiv.org/html/2312.04793v1
[3] https://github.com/topics/image-caption?l=python&o=desc&s=updated
[4] https://arxiv.org/abs/2111.09734
[5] https://www.researchgate.net/publication/356375969_ClipCap_CLIP_Prefix_for_Image_Captioning