You can download the BUFFET data from Huggingface Datasets.
Despite remarkable advancements in few-shot generalization in natural language processing, the majority of models are developed and evaluated primarily in English. To enable fair model comparisons, we, therefore, propose a new benchmark, called BUFFET, which unifies 16 diverse tasks across 57 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different learning methods, namely in-context learning, and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved training, in-context learning{, and future evaluations.
BUFFET (Benchmark of Unified Format FEw-shot Transfer Evaluation) is designed to enable rigorous evaluations and advance research on few-shot cross-lingual transfer. Similar to a rich buffet, BUFFET curates a diverse mix of tasks: 15 different tasks---including classification, structured prediction, and natural language generation---across 54 languages.
We study 6 different language (e.g., mT5, BLOOMZ and ChatGPT models and diverse transfer methods, including both fine-tuning (FT) and in-context learning (ICL)).
All transfer methods show high performance variance across different k-shot samples, with more significant gap in ICL methods.
We also found that while in fine-tuning adding more demonstrations often helps, in ICL adding more demonstrations can hurt performance, especially in instruction-tuned LLMs.
Based on BUFFET and large-scale experiments, we suggest exciting opportunities for future research in the field of few-shot learning transfer across diverse languages. In summary:
More detailed discussions are in our paper.
@article{asai2023buffet, author = {Asai, Akari and Kudugunta, Sneha and Yu, Xinyan Velocity and Blevins, Terra and Gonen, Hila B and Reid, Machel and Tsvetkov, Yulia and Ruder, Sebastian and Hajishirzi, Hannaneh}, title = {{BUFFET}: Benchmarking Large Language Models for Cross-lingual Few-shot Transfer}, year = {2023}, }