|
2 | 2 |
|
3 | 3 | These instructions cover the usage of the latest stable release of SHARK. For a more bleeding edge release please install the [nightly releases](nightly_releases.md).
|
4 | 4 |
|
| 5 | +> [!TIP] |
| 6 | +> Please note as we are prepping the next stable release, please use [nightly releases](nightly_releases.md) for usage. |
| 7 | +
|
5 | 8 | ## Prerequisites
|
6 | 9 |
|
7 | 10 | Our current user guide requires that you have:
|
@@ -64,61 +67,19 @@ pip install shark-ai[apps]
|
64 | 67 | python -m shortfin_apps.sd.server --help
|
65 | 68 | ```
|
66 | 69 |
|
67 |
| -## Quickstart |
| 70 | +## Getting started |
68 | 71 |
|
69 |
| -### Run the SDXL Server |
| 72 | +As part of our current release we support serving [SDXL](https://stablediffusionxl.com/) and [Llama 3.1](https://ai.meta.com/blog/meta-llama-3-1/) variants as well as an initial release of `sharktank`, SHARK's model development toolkit which is leveraged in order to compile these models to be high performant. |
70 | 73 |
|
71 |
| -Run the [SDXL Server](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server) |
| 74 | +### SDXL |
72 | 75 |
|
73 |
| -### Run the SDXL Client |
| 76 | +To get started with SDXL, please follow the [SDXL User Guide](../shortfin/python/shortfin_apps/sd/README.md#Start-SDXL-Server) |
74 | 77 |
|
75 |
| -``` |
76 |
| -python -m shortfin_apps.sd.simple_client --interactive |
77 |
| -``` |
78 | 78 |
|
79 |
| -Congratulations!!! At this point you can play around with the server and client based on your usage. |
80 |
| - |
81 |
| -### Note: Server implementation scope |
82 |
| - |
83 |
| -The SDXL server's implementation does not account for extremely large client batches. Normally, for heavy workloads, services would be composed under a load balancer to ensure each service is fed with requests optimally. For most cases outside of large-scale deployments, the server's internal batching/load balancing is sufficient. |
84 |
| - |
85 |
| -### Update flags |
86 |
| - |
87 |
| -Please see --help for both the server and client for usage instructions. Here's a quick snapshot. |
88 |
| - |
89 |
| -#### Update server options: |
90 |
| - |
91 |
| -| Flags | options | |
92 |
| -|---|---| |
93 |
| -|--host HOST | |
94 |
| -|--port PORT | server port | |
95 |
| -|--root-path ROOT_PATH | |
96 |
| -|--timeout-keep-alive | |
97 |
| -|--device | local-task,hip,amdgpu | amdgpu only supported in this release |
98 |
| -|--target | gfx942,gfx1100 | gfx942 only supported in this release |
99 |
| -|--device_ids | |
100 |
| -|--tokenizers | |
101 |
| -|--model_config | |
102 |
| -| --workers_per_device | |
103 |
| -| --fibers_per_device | |
104 |
| -| --isolation | per_fiber, per_call, none | |
105 |
| -| --show_progress | |
106 |
| -| --trace_execution | |
107 |
| -| --amdgpu_async_allocations | |
108 |
| -| --splat | |
109 |
| -| --build_preference | compile,precompiled | |
110 |
| -| --compile_flags | |
111 |
| -| --flagfile FLAGFILE | |
112 |
| -| --artifacts_dir ARTIFACTS_DIR | Where to store cached artifacts from the Cloud | |
113 |
| - |
114 |
| -#### Update client with different options: |
115 |
| - |
116 |
| -| Flags |options| |
117 |
| -|---|--- |
118 |
| -|--file | |
119 |
| -|--reps | |
120 |
| -|--save | Whether to save image generated by the server | |
121 |
| -|--outputdir| output directory to store images generated by SDXL | |
122 |
| -|--steps | |
123 |
| -|--interactive | |
124 |
| -|--port| port to interact with server | |
| 79 | +### Llama 3.1 |
| 80 | + |
| 81 | +To get started with Llama 3.1, please follow the [Llama User Guide](shortfin/llm/user/llama_serving.md). |
| 82 | + |
| 83 | +* Once you've set up the Llama server in the guide above, we recommend that you use [SGLang Frontend](https://sgl-project.github.io/frontend/frontend.html) by following the [Using `shortfin` with `sglang` guide](shortfin/llm/user/shortfin_with_sglang_frontend_language.md) |
| 84 | +* If you would like to deploy LLama on a Kubernetes cluster we also provide a simple set of instructions and deployment configuration to do so [here](shortfin/llm/user/llama_serving_on_kubernetes.md). |
| 85 | +* Finally, if you'd like to leverage the instructions above to run against a different variant of Llama 3.1, it's supported. However, you will need to generate a gguf dataset for that variant. In order to do this leverage the [HuggingFace](https://huggingface.co/)'s [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) in combination with [llama.cpp](https://github.com/ggerganov/llama.cpp)'s convert_hf_to_gguf.py. In future releases, we plan to streamline these instructions to make it easier for users to compile their own models from HuggingFace. |
0 commit comments