-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Local LLM! #160
Comments
Some notes on tech. Quantization. you'll see things like Parameters. 3b, 7b, 13b, etc. You only see good results 13b and up. Only in the 65b range are we competing with GPT-3.5. No models yet push GPT4, so GPT3 is the current benchmark. Take note of that! If you're ok on the OpenAI privacy stuff, you should just do it - it's cheaper, easier, and significantly higher quality. Ok, so if 65b is ideal quality, why go as low as 7b? It's because 3-13 is the range which can run on consumer hardware. Or even cloud hardware, if not parallelized across multiple machines. 13b is something of a consumer-hardware upper limit, 3b is too strong a quality sacrifice, so 7b is the sweet spot. So: unless someone knows better with more recent models, |
I gave koboldcpp a spin, it seems pretty role-play centric. It might be worth looking into KoboldAI (not sure how it's different from koboldcpp, except maybe it's a more one-size solution?). I tried text-generation-webui - it looks really promising. You download a model (eg ggml models from TheBloke) into a folder, choose the model and select the prompt/instruct template from a dropdown (each model has different templates, like Anyway, one thing I love about text-generation-webui is it has options for exposing the service as an API, including via ngrok - all through the UI. It also has installers (eg Windows installer), so it's dirt simple for users to setup for their own usage. If anyone has more ideas on models or tools (to host as API), LMK |
It looks like Microsoft has partnered with Meta for Llama 2 deployment on Windows & Azure. My thinking is we can get Llama 2 70b on an Azure deployment for most users (premium), so they can choose between Llama vs OpenAI. And free users can proxy to localhost for a WSL2-running local Llama 2 instance (or however Windows enables Llama, I haven't looked into it). This could be a very awesome answer to this ticket, if I'm understanding it correctly. As for OpenAI models themselves. I just did some digging, and it looks like OpenAI only keeps data submitted for 30 days, and does not use the data to fine-tune. So if 30days of retention is comfortable as long as the data isn't otherwise used, then the concerns here are less than their prior data policy. I should express this more clearly somewhere on the site. HOWEVER, even better - Azure's GPT4 policies indicate NO sort of retention, usage, or anything around submitted data. This could be a significant improvement if I'm understanding it correctly. Would love someone's thoughts if I haven't updated here before. |
https://github.com/ocdevel/gnothi/tree/llama2 has Llama2 7b quantized running on Lambda. But ctransformers which runs TheBloke's quantized version depends on GLIBC_2.29, which isn't available in Amazon Linux 2 https://github.com/marella/ctransformers/issues?q=glibc. I tried a custom Dockerfile for the Lambda function extending Amazon Linux 2023, but it's too hard to get a custom Dockerfile to behave like Lambda (passing Incidentally this also rules out ctransformers for SageMaker, since SM uses Amazon Linux 2 as well! So our best bet is AWS Batch with an Alpine / Debian / AL2023 container. Which is fine, I've been needing to get out of Lambda-land for ML inference anyway. I'm gonna table this for now though, and I'm really hoping AWS launches Titan soon enough. |
[Edit 7/20/23]: Let's use Llama 2. AWS / Azure might have hosted versions too, so no local needed.
If there's any ticket I need engagement from the community, it's this one. Adding the ability for users to use their own locally-hosted Large Language Model like Llama, Vicuna, etc.
Current state: OpenAI
Going Premium unlocks OpenAI for (1) summaries & themes; (2) prompt. Summaries & themes without OpenAI are lower accuracy, based on pre-trained huggingface models less sophistication. But they shine for users who prefer isolated compute vs OpenAI. So it's a sort of trade: quality vs transparency. Prompt, on the other hand, is binary - you only get it if using Premium. One reason we charge for OpenAI usage is API costs, obviously. But another reason is the manual step someone goes through to put the site into OpenAI-mode, and there's messaging around what's about to happen, OpenAI's T&S, etc. So it's a gate. FWIW, we're looking into AWS's Titan, and Google's LLM API. AWS is disinterested in user data, so we'd prefer them - but haven't heard a peep for a while on Titan. Google on the other hand, while they have a history with privacy, they also have a history fixing said issues; eg with GDPR. It's all new territory, and we're keeping on top of it.
I want to note: @BirdieLady and I use and trust OpenAI (Prompt) with our journals. If you want my opinion, trust it. But also follow your heart. We find this feature almost as valuable as the whole site itself. It veritably doubled our value of Gnothi, with one single feature. There has been incredible personal depth exploration, new insights, and even actionable take-aways which have bettered our lives by using this feature. So it would be a travesty for it not to be used / explored due to the selection of backend API. So:
Ideal state: OpenAI or BYO LLM
For users who don't want to use OpenAI, but do want Prompt and the added quality of LLM summaries/themes; add the option to use one's own LLM. This would require:
We'll need to create a Wiki for setting all this up; for a list of recommended models, which models to avoid for wellness purposes, etc.
Hosting tools
Most of the action is happening here https://www.reddit.com/r/LocalLLaMA/. This user recommended exploring Kobold.cpp as a consistent/simple API setup on the users' machine:
Alternatives include llama.cpp and text-generation-web-ui. I've personally played with llama.cpp, it wasn't the level of simplicity I'd like to expect of our BYO-model users. There's also a project web-llm which would run the model in the user's browser, but that would require a very strong machine, and I fear it could more easily be misunderstood for its resource requirements (since it's so easy to "turn on") than if someone setup a BYO box at home. If anyone has experience / opinions here, please chime in!
Models
TL:DR: pick something from HF Leaderboard. Keep an eye on Meta's Llama v2.
As for the models themselves. Firstly, our best bet is anything quantized by TheBloke. His various GGML models take the original LLM and "minify" it via quantization to the point where a 7B or 13B parameter model, previously requiring cloud GPUs, can be run on local hardware. We should really be targeting 7B models, as that's the sweet-spot around 16-32gb RAM requirements; 13B models get very taxing.
When I left of my exploration (April 2023), I was poking around the list below. However, ignore this list and instead go to HF Leaderboard. That's constantly changing with SOTA. Also, Meta, whose first model Llama sparked the revolution but which has licensing constraints, will release a v2 which will likely clobber the leaderboard (and without licensing issues). My old list:
T5-Flan. Simple, terse. It's not a chat-bot, more a 1-word answerer. Very easily fine-tuned (trained), unlike the others (trains well with few samples, trains fast). See sample training, prompts, model, model docsBackground info:
Tasks
Consider web-llmNo-go, too intense requirement in-browser for little payoff. Good for fiddling, not for this.The text was updated successfully, but these errors were encountered: