Thank you for the great work and the pre-print! I have a question in running the code. I would appreciate if you could answer it.
As for installation, I followed the standard steps as in,
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:24.04-py3
pip install -e .
Then, I tried a simple longbench run as,
python run_longbench.py --dataset narrativeqa --model llama3--protected-window-size 8 --prefill-metric-collection window-size 8 --max-cache-tokens 512
However, I am getting missing 1 required positional argument: 'block_state' error. The full error track is the following,
/workspace/vllm-kvcompress-main/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
from vllm.version import version as VLLM_VERSION
WARNING 10-28 20:18:13 config.py:632] Model has sliding window configured, but it will be disabled due to incompatibility with KV-Compress.
WARNING 10-28 20:18:13 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 20:18:13 llm_engine.py:219] Initializing an LLM engine (v0.6.0) with config: model='daryl149/llama-2-7b-chat-hf', speculative_config=None, tokenizer='daryl149/llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=daryl149/llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Allocating context_lens - Mem: 0.0
Allocating block table - Mem: 0.001048576
Allocating head bias - Mem: 0.13526732800000002
INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-28 20:18:15 model_runner.py:964] Starting to load model daryl149/llama-2-7b-chat-hf...
INFO 10-28 20:18:15 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:15 selector.py:116] Using XFormers backend.
INFO 10-28 20:18:15 weight_utils.py:236] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
/workspace/vllm-kvcompress-main/vllm/model_executor/model_loader/weight_utils.py:416: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.06s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.74s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.08s/it]
INFO 10-28 20:18:23 model_runner.py:975] Loading model weights took 12.5518 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in
[rank0]: main(args)
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]: model = LLM(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]: model_input = self.prepare_model_input(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'
Could you help me with how to fix this?
Before submitting a new issue...
Thank you for the great work and the pre-print! I have a question in running the code. I would appreciate if you could answer it.
As for installation, I followed the standard steps as in,
Then, I tried a simple longbench run as,
python run_longbench.py --dataset narrativeqa --model llama3--protected-window-size 8 --prefill-metric-collection window-size 8 --max-cache-tokens 512However, I am getting
missing 1 required positional argument: 'block_state'error. The full error track is the following,/workspace/vllm-kvcompress-main/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
from vllm.version import version as VLLM_VERSION
WARNING 10-28 20:18:13 config.py:632] Model has sliding window configured, but it will be disabled due to incompatibility with KV-Compress.
WARNING 10-28 20:18:13 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 10-28 20:18:13 llm_engine.py:219] Initializing an LLM engine (v0.6.0) with config: model='daryl149/llama-2-7b-chat-hf', speculative_config=None, tokenizer='daryl149/llama-2-7b-chat-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=daryl149/llama-2-7b-chat-hf, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the
legacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.Allocating context_lens - Mem: 0.0
Allocating block table - Mem: 0.001048576
Allocating head bias - Mem: 0.13526732800000002
INFO 10-28 20:18:14 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:14 selector.py:116] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning:
torch.library.impl_abstractwas renamed totorch.library.register_fake. Please use that instead; we will removetorch.library.impl_abstractin a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-28 20:18:15 model_runner.py:964] Starting to load model daryl149/llama-2-7b-chat-hf...
INFO 10-28 20:18:15 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-28 20:18:15 selector.py:116] Using XFormers backend.
INFO 10-28 20:18:15 weight_utils.py:236] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
/workspace/vllm-kvcompress-main/vllm/model_executor/model_loader/weight_utils.py:416: FutureWarning: You are using
torch.loadwithweights_only=False(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_onlywill be flipped toTrue. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals. We recommend you start settingweights_only=Truefor any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.06s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.74s/it]
Loading pt checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.08s/it]
INFO 10-28 20:18:23 model_runner.py:975] Loading model weights took 12.5518 GB
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 185, in
[rank0]: main(args)
[rank0]: File "/workspace/vllm-kvcompress-main/experiments/run_longbench.py", line 63, in main
[rank0]: model = LLM(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/entrypoints/llm.py", line 177, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 584, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 359, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/engine/llm_engine.py", line 494, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks(kv_metrics))
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/executor/gpu_executor.py", line 122, in determine_num_available_blocks
[rank0]: return self.driver_worker.determine_num_available_blocks(kv_metrics)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/worker.py", line 237, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1175, in profile_run
[rank0]: model_input = self.prepare_model_input(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1430, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 1091, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: File "/workspace/vllm-kvcompress-main/vllm/worker/model_runner.py", line 784, in build
[rank0]: attn_metadata = self.attn_metadata_builder.build(
[rank0]: TypeError: CommonMetadataBuilder.build() missing 1 required positional argument: 'block_state'
Could you help me with how to fix this?
Before submitting a new issue...