Collaborative benchmarking task (CBT)

The goal of the second GenBench Collaborative Benchmarking task (CBT), is to generate versions of existing evaluation datasets for LLMs which, given a particular training corpus, have a larger distribution shift from the training corpus than the original test set, or – in other words – that evaluate generalisation to a stronger degree. For this particular challenge, we focus on three training corpora: C4, RedPAjama-Data-1T and Dolma. All three corpora are publicly available, and participants can use What’s in My Big Data (WiMBD)’s API to programmatically search through these coropora. We will focus on three popular evaluation datasets: MMLU, HumanEval, and SiQA.

Specifically, submitters to the CBT are asked to design a way to assess distribution shift for one or more of these evaluation datasets, given particular features of the training corpus, and then generate one or more versions of the dataset that have a larger distribution shift according to this method.
Newly generated sets do not have to have the same size as the original test set, but should have at least 200 examples. Practically speaking, CBT submissions consist of:

The data/task artefact, submitted through https://github.com/GenBench/genbench_cbt
A paper describing the dataset and its method of construction, submitted through https://openreview.net/group?id=GenBench.org/2024/Workshop#tab-recent-activity

We accept submissions that consider only one pretraining dataset and evaluation dataset, but encourage submitters to apply their suggested protocols to both pretraining datasets. We also suggest that submitters include model results for models trained on these datasets, T5, Llama1 and RedPajama-INCITE and Olmo, for C4, RedPAjama and Dolma, respectively. Given enough high-quality submissions, we aim to write a paper with the combined results, to which submitters can be co-authors, if they wish so.

The paper will be reviewed via a regular paper review pipeline and included in the proceedings of the GenBench workshop. Tasks corresponding to accepted papers will be merged into the GenBench repository.

Important dates (tentative)

August 15, 2024 – Paper submission deadline
September 20, 2024 – Notification deadline
October 4, 2024 – Camera ready deadline
November 15 or 16, 2024 – Workshop

Submit a task to the collaborative benchmark

As described above, CBT submissions consist of a data/task artefact and an accompanying paper. The data/task artefacts are submitted via a pull request on the GenBenchCBT github repository, whereas the papers are submitted via openreview.

CBT papers

CBT submissions to the Genbench CBT github repository should be accompanied by (anonymous) papers, that can be submitted via the paper submission page of the GenBench workshop. In the submission form, please indicate that your paper describes a CBT submission, and add the number of your PR in the github repository. CBT submissions to the Genbench CBT github repository should be accompanied by anonymous papers, submitted via the Openreview page for the GenBench workshop. In the submission form, please indicate that your paper describes a CBT submission, and add the number of your PR in the github repository. The paper should at least contain the following information:

Discuss the pretraining set and evaluation set you work on
Discuss how you measure distribution shift or generalisation
Describe the method you used to manipulate the evaluation data to increase the distribution shift
present an evaluation of some models of your choosing
contain a GenBench evaluation card describing its values on the GenBench generalisation taxonomy.

You can submit both an archival paper or an extended abstract. Reviewers will be instructed to assess your submission similarly to regular paper submissions, and acceptance of your task is based on the paper, provided that your artefact technically fits within the framework. Tasks from accepted papers will be merged into the GenBench repository, and the archival CBT papers will be included in the proceedings of the GenBench workshop.”

FAQ

What is generalisation?

The answer to the question “what is generalisation” likely differs per person, and one of the goals of this year’s CBT is to get more actionable protocols to generate datasets that evaluate different forms of generalisation. We are looking forward to seeing your submissions!

Are there any minimal requirements for CBT papers?

CBT submissions should follow the EMNLP 2024 template. Other than that, there are no strict requirements, though we do ask that you add a GenBench evaluation card to your work to indicate what kind of generalisation you focus on.

Collaborative benchmarking task (CBT) – 2024

Important dates (tentative)

Submit a task to the collaborative benchmark

CBT papers

FAQ

What is generalisation?

Are there any minimal requirements for CBT papers?