[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'

Thank you for the great work and the pre-print! I have a question in running the code. I would appreciate if you could answer it.

As for installation, I followed the standard steps as in,
```
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3
pip install -e .
```


Then, I tried a simple longbench run as,

`python run_longbench.py --dataset narrativeqa --model llama3--protected-window-size 8 --prefill-metric-collection window-size 8 --max-cache-tokens 512`

However, I am getting `missing 1 required positional argument: 'block_state'` error. The full error track is the following,


/workspace/vllm-kvcompress-main/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
  from vllm.version import __version__ as VLLM_VERSION
WARNING 10-28 20:18:13 config.py:632] Model has sliding window configured, but it will be disabled due to incompatibility with KV-Compress.
WARNING 10-28 20:18:13 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 20:18:13 llm_engine.py:219] Initializing an LLM engine (v0.6.0) with config: model='daryl149/llama-2-7b-chat-hf', speculative_config=None, tokenizer='daryl149/llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=daryl149/llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Allocating context_lens - Mem: 0.0
Allocating block table - Mem: 0.001048576
Allocating head bias - Mem: 0.13526732800000002
INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-28 20:18:15 model_runner.py:964] Starting to load model daryl149/llama-2-7b-chat-hf...
INFO 10-28 20:18:15 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:15 selector.py:116] Using XFormers backend.
INFO 10-28 20:18:15 weight_utils.py:236] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
/workspace/vllm-kvcompress-main/vllm/model_executor/model_loader/weight_utils.py:416: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards:  50% Completed | 1/2 [00:06<00:06,  6.06s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00,  3.74s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00,  4.08s/it]

INFO 10-28 20:18:23 model_runner.py:975] Loading model weights took 12.5518 GB
[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in <module>
[rank0]:     main(args)
[rank0]:   File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]:     model = LLM(
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]:     return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]:     model_input = self.prepare_model_input(
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]:     model_input = self._prepare_model_input_tensors(
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]:     return builder.build()  # type: ignore
[rank0]:   File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]:     attn_metadata = self.attn_metadata_builder.build(
[rank0]: TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'

Could you help me with how to fix this?

### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' #2

Before submitting a new issue...

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Installation&Bug]: First example. TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state' #2

Description

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions