One command evaluation API for fast and thorough evaluation of LMMs, providing multi-faceted insights on model performance with over 40 datasets.
In today’s world, we’re on a thrilling quest for Artificial General Intelligence (AGI), driven by a passion that reminds us of the excitement surrounding the 1960s moon landing. At the heart of this adventure are the incredible large language models (LLMs) and large multimodal models (LMMs). These models are like brilliant minds that can understand, learn, and interact with a vast array of human tasks, marking a significant leap toward our goal.
To truly understand how capable these models are, we’ve started to create and use a wide variety of evaluation benchmarks. These benchmarks help us map out a detailed chart of abilities, showing us how close we are to achieving true AGI. However, this journey is not without its challenges. The sheer number of benchmarks and datasets we need to look at is overwhelming. They’re all over the place - tucked away in someone’s Google Drive, scattered across Dropbox, and hidden in the corners of various school and research lab websites. It’s like embarking on a treasure hunt where the maps are spread far and wide.
In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness
However, the evaluation of multi-modality models is still in its infancy, and there is no unified evaluation framework that can be used to evaluate multi-modality models across a wide range of datasets. To address this challenge, we introduce lmms-eval
We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness. Building upon its foundation, we implemented our lmms-eval
framework with performance optimizations specifically for LMMs.
We believe our effort could provide an efficient interface for the detailed comparison of publicly available models to discern their strengths and weaknesses. It’s also useful for research institutions and production-oriented companies to accelerate the development of Large Multimoal Models. With the lmms-eval
, we have significantly accelerated the lifecycle of model iteration. Inside the LLaVA team, the utilization of lmms-eval
largely improves the efficiency of the model development cycle, as we are able to evaluate weekly trained hundreds of checkpoints on 20-30 datasets, identifying the strengths and weaknesses, and then make targeted improvements.
For more usage guidance, please visit our GitHub repo.
You can evaluate the models on multiple datasets with a single command. No model/data preparation is needed, just one command line, few minutes, and get the results. Not just a result number, but also the detailed logs and samples, including the model args, input question, model response, and ground truth answer.
We support the usage of accelerate
to wrap the model for distributed evaluation, supporting multi-gpu and tensor parallelism. With Task Grouping, all instances from all tasks are grouped and evaluated in parallel, which significantly improves the throughput of the evaluation.
Below are the total runtime on different datasets using 4 x A100 40G.
Dataset (#num) | LLaVA-v1.5-7b | LLaVA-v1.5-13b |
---|---|---|
mme (2374) | 2 mins 43 seconds | 3 mins 27 seconds |
gqa (12578) | 10 mins 43 seconds | 14 mins 23 seconds |
scienceqa_img (2017) | 1 mins 58 seconds | 2 mins 52 seconds |
ai2d (3088) | 3 mins 17 seconds | 4 mins 12 seconds |
coco2017_cap_val (5000) | 14 mins 13 seconds | 19 mins 58 seconds |
We are hosting more than 40 (and increasing) datasets on huggingface/lmms-lab, we carefully converted these datasets from original sources and included all variants, versions and splits. Now they can be directly accessed without any burden of data preprocessing. They also serve for the purpose of visualizing the data and grasping the sense of evaluation tasks distribution.
We provide detailed logging utilities to help you understand the evaluation process and results. The logs include the model args, generation parameters, input question, model response, and ground truth answer. You can also record every details and visualize them inside runs on Weights & Biases.
As demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).
We provide a Google Sheet for the detailed results of the LLaVA series models on different datasets. You can access the sheet here. It’s a live sheet, and we are updating it with new results.
We also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data here.
Different models perform best at specific prompt strategies and require models developers skilled knowledge to implement. We prefer not to hastily integrate a model without a thorough understanding to it. Our focus is on more tasks, and we encourage model developers to incorporate our framework into their development process and, when appropriate, integrate their model implementations into our framework through PRs (Pull Requests). Within this strategy, we support a wide range of datasets and selective model series, including:
Names inside ()
indicate the actual task name referred in the config file.
(ai2d)
(chartqa)
(cmmmu)
(cmmmu_val)
(cmmmu_test)
(coco_cap)
(coco2014_cap)
(coco2014_cap_val)
(coco2014_cap_test)
(coco2017_cap)
(coco2017_cap_val)
(coco2017_cap_test)
(docvqa)
(docvqa_val)
(docvqa_test)
(ferret)
(flickr30k)
(ferret_test)
(gqa)
(hallusion_bench_image)
(info_vqa)
(info_vqa_val)
(info_vqa_test)
(llava_bench_wild)
(llava_bench_coco)
(mathvista)
(mathvista_testmini)
(mathvista_test)
(mmbench)
(mmbench_en)
(mmbench_en_dev)
(mmbench_en_test)
(mmbench_cn)
(mmbench_cn_dev)
(mmbench_cn_test)
(mme)
(mmmu)
(mmmu_val)
(mmmu_test)
(mmvet)
(multidocvqa)
(multidocvqa_val)
(multidocvqa_test)
… and more.
@misc{lmms_eval2024,
title={LMMs-Eval: Accelerating the Development of Large Multimoal Models},
url={https://github.com/EvolvingLMMs-Lab/lmms-eval},
author={Bo Li*, Peiyuan Zhang*, Kaicheng Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li and Ziwei Liu},
publisher = {Zenodo},
version = {v0.1.0},
month={March},
year={2024}
}