- 2024.05.28: We release the instruction-tuning data for LATIN-Tuning.
- 2023.09.07: We support the Llama2 and Llama2-chat models.
- 2023.09.06: We introduce LATIN-Tuning for Alpaca, which enhances its zero-shot performance on DocVQA from 0.3567 to 0.6697.
- 2023.06.30: We now provide implementations based on Azure OpenAI text-davinci-003 Completion.
- 2023.06.29: We now provide implementations based on Alpaca-7B and Vicuna-13B.
- DUE OCR results
- Alpaca 7B
- Vicuna 13B
- Azure OpenAI gpt-3.5-turbo + Completion
- Azure OpenAI gpt-3.5-turbo + ChatCompletion (doing)
- Azure OpenAI text-davinci-003 + Completion
- MPT-30B-Chat (todo)
- Orca (todo)
- GPT-4 and offical OpenAI API (todo, we are working hard to seek access to official OpenAI API)
- LLaMA2-chat
pip install -r requirements.txtSet the ANTHROPIC_API_KEY for Cluade. Please refer to the script ./utils/claude.py.
export ANTHROPIC_API_KEY="Your API key"Set the OPENAI_API_KEY and OPENAI_API_BASE for Azure OpenAI. Please refer to the script ./utils/openai_api.py. For the differences between Azure OpenAI and OpenAI, see here.
export OPENAI_API_KEY="Your API key"
export OPENAI_API_BASE="Your base url"Note: Currently, due to resource constraints, our experiments are all based on the Azure OpenAI API. At the same time, we are working hard to seek access to official OpenAI API. If you can provide relevant resources, please contact the author and we will be very grateful.
export DATAS_DIR="Your data directory"- Download DocVQA dataset with Azure OCR results from DUE Benchmark and put it into the
DATAS_DIR. - Download DocVQA dataset with Official OCR results from Robust Reading Competition and put it into the
DATAS_DIR.
Refer to the script ./utils/model_path_config.py.
bash script/claude_eval.sh 0 claude docvqa_due_azure task_instruction_spacebash script/claude_eval.sh 0 claude docvqa_due_azure plainbash script/claude_eval.sh 0 claude docvqa_due_azure task_instructionbash script/claude_eval.sh 0 claude docvqa_due_azure spacebash script/claude_eval.sh 0 gpt-35 docvqa_due_azure task_instruction_spacebash script/claude_eval.sh 0 gpt-35 docvqa_due_azure plainbash script/claude_eval.sh 0 gpt-35-chat docvqa task_instruction_spacebash script/claude_eval.sh 0 text-davinci-003 docvqa_due_azure task_instruction_spacebash script/llama_eval.sh 0 alpaca-7b docvqa_due_azure task_instruction_spacebash script/vllm_eval.sh 0 vicuna-13b docvqa_due_azure task_instruction_spacebash script/vllm_eval.sh 0 llama2-13b-chat docvqa_due_azure task_instruction_spaceThe performance in this table is based on the Azure OCR results provided in DUE Benchmark by default. The Official OCR represents the performance is based on the OCR results provided in Robust Reading Competition
<style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-lboi{border-color:inherit;text-align:left;vertical-align:middle} .tg .tg-9wq8{border-color:inherit;text-align:center;vertical-align:middle} .tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top} </style>| Model | Prompt | Test Data | Val Data | ||
|---|---|---|---|---|---|
| ANLS | ⬆ | ANLS | ⬆ | ||
| Claude | Plain | 0.2298 | - | 0.2144 | - |
| LATIN | 0.8366 | +0.6038 | 0.8311 | +0.6167 | |
| Azure OpenAI ChatGPT (Completion) | Plain | 0.6866 | - | 0.6795 | - |
| LATIN | 0.8255 | +0.1389 | 0.8135 | +0.1340 | |
| Azure OpenAI ChatGPT (ChatCompletion) | Plain | TODO | - | TODO | - |
| LATIN | TODO | TODO | 0.5954 (Official OCR) | TODO | |
| Azure OpenAI text-davinci-003 (Completion) | LATIN | - | - | 0.8188 | - |
| Alpaca (7B) | Plain | 0.3567 | - | 0.3506 | - |
| LATIN | 0.4200 | +0.0633 | 0.4304 | +0.0798 | |
| LATIN-Tuning + LATIN-Prompt | 0.6697 | +0.3130 | 0.6668 | +0.3162 | |
| Vicuna (13B) | Plain | 0.0710 | - | 0.0688 | - |
| LATIN | 0.4725 | +0.4015 | 0.4597 | +0.3909 | |
| Llama2-13b-chat | Plain | 0.1783 | - | 0.1863 | - |
| LATIN | 0.4283 | +0.2500 | 0.4435 | +0.2572 | |
@misc{wang2023layout,
title={Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering},
author={Wenjin Wang and Yunhao Li and Yixin Ou and Yin Zhang},
year={2023},
eprint={2306.00526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}