log

/usr/local/lib/python3.11/dist-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/usr/local/lib/python3.11/dist-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO:server:Set CUDA_VISIBLE_DEVICES to 0
INFO:server:http://0.0.0.0:30000, ports: PortArgs(tokenizer_port=10000, router_port=10001, detokenizer_port=10002, nccl_port=10003, migrate_port=10004, model_rpc_ports=[10005, 10006, 10007])
INFO:model_rpc:Use sleep forwarding: False
INFO:model_rpc:schedule_heuristic: fcfs-s
INFO:model_runner:Rank 0: load weight begin.
INFO:model_runner:Rank 0: load weight end.
INFO:model_runner:kv one token size: 32 * 128 * 32 * 2 * 2 = 524288 bytes
INFO:model_runner:kv one token size: 32 * 128 * 32 * 2 * 2 = 524288 bytes
INFO:model_runner:total_cpu_memory_GB : 197.07050323486328, max_total_num_token : 7806, max_cpu_num_token : 283959
INFO:model_rpc:Rank 0: max_total_num_token=7806, max_prefill_num_token=33768, context_len=33768, 
INFO:model_rpc:server_args: enable_flashinfer=True, attention_reduce_in_fp32=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_disk_cache=False, 
INFO:sglang.srt.managers.router.radix_cache:using RadixCache
INFO:     Started server process [5172]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
INFO:model_rpc:Cache flushed successfully!
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.3921 -> 0.4421
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.4407 -> 0.4907
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.4902 -> 0.5402
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.5321 -> 0.5821
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.5813 -> 0.6313
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.6294 -> 0.6794
INFO:model_rpc:GPU 0: decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.6614 -> 0.7114
INFO:sglang.srt.managers.router.radix_cache:len(self.cnt_time): 0
You pressed Ctrl+C! Shutting down all remote servers...
INFO:     Shutting down
model /hy-tmp/ loaded.
len(self.cnt_time): 0
You pressed Ctrl+C! Shutting down all remote servers...
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [5172]
Server is on port 30000 on host 0.0.0.0 on pid 5462
You pressed Ctrl+C! Shutting down all remote servers...
Loading runtimes at ['http://0.0.0.0:30000/generate']
You pressed Ctrl+C! Shutting down all remote servers...
You pressed Ctrl+C! Shutting down all remote servers...