BUFFET: Benchmarking Large Language Models for Cross-lingual Few-shot Transfer

1University of Washington, 2Google DeepMind 3Allen Institute for AI

BUFFET is a new benchmark for a fair and scalable evaluation of few-shot cross-lingual transfer, covering 15 diverse tasks and 54 typologically diverse languages.

BUFFET teaser.

How to use BUFFET

You can download the BUFFET data from Huggingface Datasets.

Abstract

Despite remarkable advancements in few-shot generalization in natural language processing, the majority of models are developed and evaluated primarily in English. To enable fair model comparisons, we, therefore, propose a new benchmark, called BUFFET, which unifies 16 diverse tasks across 57 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples. BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer across a broad range of tasks and languages. Using BUFFET, we perform thorough evaluations of state-of-the-art multilingual large language models with different learning methods, namely in-context learning, and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. In particular, ChatGPT with in-context learning often performs worse than much smaller mT5-base models fine-tuned on English task data and few-shot in-language examples. Our analysis suggests various avenues for future research in few-shot cross-lingual transfer, such as improved training, in-context learning{, and future evaluations.

The BUFFET Benchmark

BUFFET (Benchmark of Unified Format FEw-shot Transfer Evaluation) is designed to enable rigorous evaluations and advance research on few-shot cross-lingual transfer. Similar to a rich buffet, BUFFET curates a diverse mix of tasks: 15 different tasks---including classification, structured prediction, and natural language generation---across 54 languages.

BUFFET overview.

Main Results

We study 6 different language (e.g., mT5, BLOOMZ and ChatGPT models and diverse transfer methods, including both fine-tuning (FT) and in-context learning (ICL)).

buffet results.
  • Under the comparable few-shot setup (e.g., a model has access to only 32 examples written in a target language), we found that much smaller fine-tuned models (mT5-base) often outperforms SOTA LLMs including ChatGPT.
  • Instruction-tuned LLMs, particularly ChatGPT, are competitive in high resource (HR) languages, the tasks that require generations in the target languages, or the tasks with limited data even in resource-rich languages.
  • In less-represented (lowe-resource; LR) languages, such LLMs with ICL struggles a lot, while a smaller fine-tuned models trained on 32 examples can retain competitive performance (e.g., outperform ChatGPT by 10% on NLI in some indigenous languages of the Americas).
buffet results.

Analysis

High variance across different demonstrations.

All transfer methods show high performance variance across different k-shot samples, with more significant gap in ICL methods.

buffet results.

The optimal configuration for transfers varies among different models and transfer methods.

We also found that while in fine-tuning adding more demonstrations often helps, in ICL adding more demonstrations can hurt performance, especially in instruction-tuned LLMs.

buffet results.

Moving Forward

Based on BUFFET and large-scale experiments, we suggest exciting opportunities for future research in the field of few-shot learning transfer across diverse languages. In summary:

  • Improve instruction-tuning for better cross-lingual transfer
  • Expand evaluations to more diverse tasks (generations, reasoning and knowledge) and languages (diverse local languages, including under-represented languages and their dialects)
  • Overcome data scarcity using LLMs

More detailed discussions are in our paper.

BibTeX

@article{asai2023buffet,
  author    = {Asai, Akari and Kudugunta, Sneha and Yu, Xinyan Velocity and Blevins, Terra and Gonen, Hila B and Reid, Machel and Tsvetkov, Yulia and Ruder, Sebastian and Hajishirzi, Hannaneh},
  title     = {{BUFFET}: Benchmarking Large Language Models for Cross-lingual Few-shot Transfer},
  year      = {2023},
}