Skip to content

Commit 9664332

Browse files
committed
Fix comments 3
1 parent 589d9ee commit 9664332

File tree

2 files changed

+11
-6
lines changed

2 files changed

+11
-6
lines changed

Diff for: samples/python/text_generation/limit_checker.py

+6-5
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def retry_request(func, retries=5):
4343
"ServiceUnavailable",
4444
"InternalServerError"
4545
]
46-
46+
4747
for attempt in range(retries):
4848
try:
4949
return func()
@@ -126,10 +126,10 @@ def run_and_write_metrics(model, prompt, generation_config, report_file):
126126
print(f"result length: {result_length}")
127127
print()
128128

129-
if args.report is not None:
130-
with open(args.report, 'a') as f:
129+
if report_file is not None:
130+
with open(report_file, 'a') as f:
131131
csv_writer = csv.writer(f)
132-
csv_writer.writerow([generation_length, result_length, pipeline_opt_metrics.avg_cache_usage, pipeline_opt_metrics.max_cache_usage, rss_usage_gb])
132+
csv_writer.writerow([generation_config.max_new_tokens - 1, result_length, pipeline_opt_metrics.avg_cache_usage, pipeline_opt_metrics.max_cache_usage, rss_usage_gb])
133133
return pipeline_opt_metrics.max_cache_usage
134134

135135

@@ -194,6 +194,8 @@ def run_and_write_metrics(model, prompt, generation_config, report_file):
194194
break
195195

196196
generation_length *= 2
197+
198+
del data_dict
197199
elif args.mode == "gen_throughput":
198200
dataset = load_samsum_dataset(args.data)
199201
prompt_throughput = 1
@@ -236,5 +238,4 @@ def run_and_write_metrics(model, prompt, generation_config, report_file):
236238

237239

238240
print(f"Approximate highest throughput: {prompt_throughput} prompts")
239-
del data_dict
240241

Diff for: site/docs/concepts/optimization-techniques/kvcache-eviction-algorithm.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ The KV cache for each sequence is divided into three logical areas:
1515

1616
* Start Area: Initial tokens that are never evicted
1717
* Evictable Area: Tokens that can be evicted based on importance scores
18-
* Recent Area: Most recent tokens that are preserved (never evicted)
18+
* Recent Area: Most recent tokens that are preserved (not evicted while in this area, but naturally migrating toward the evictable area as the text generation goes on)
1919

2020
The sizes of all three areas can be configured by modifying corresponding fields in a `CacheEvictionConfig` struct, which itself is a part of the pipeline-wide `SchedulerConfig`.
2121
As the generation starts, the blocks in respective logical areas are filled token-by-token, and once at least one block past the "recent" area is filled, eviction may take place.
@@ -55,4 +55,8 @@ This may impact the ability of the model to correctly recognize the relative pos
5555
Cache rotation seeks to alleviate this by "re-rotating" corresponding blocks so that the blocks that remain after each eviction are once again "continuous" in terms of the effective RoPE embedding.
5656
It can be enabled by setting the `CacheEvictionConfig.apply_rotation` field to `true` (default is `false`).
5757

58+
## Current limitations
5859

60+
* Cache rotation is only targeted for the regular, linear LLaMa-like RoPE application and may degrade accuracy on models that use other RoPE schemes.
61+
62+
* Cache rotation is currently only supported for the models with uniform V embedding sizes across the layers.

0 commit comments

Comments
 (0)