Merge branch 'main' into dima-rag

# Conflicts: # README.md
msainio · Dec 6, 2024 · 8cbfd9c · 8cbfd9c
2 parents 9dcf843 + 20757aa
commit 8cbfd9c
Show file tree

Hide file tree

Showing 22 changed files with 14,298 additions and 7,089 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -240,3 +240,6 @@ cython_debug/
 .idea/
 
 *.pyc
+.sesskey
+
+*.html
diff --git a/README.md b/README.md
@@ -1,5 +1,8 @@
 # Large Language Models and Generative AI for NLP
 
+**THIS REPOSITORY WILL EVOLVE OVER THE DURATION OF THE COURSE. WE WILL ADD CONTENT AS WE GO.**
+
+
 **Time:** Fall 2024 / Period 2\
 **Target group:** Master's students\
 **Teachers:** 
@@ -17,28 +20,34 @@ This hands-on course delves into the world of Large Language Models (LLMs) and t
 
 ## Evaluation
 
-* Weekly lab exercise submissions (25%)
-* Final group project submission (75%)
+The course will be evaluated based on the submission of a final report.
+
+Students will need to submit a final report that covers all the labs:
+
+What was done in each lab?
+What was the motivation behind your solutions?
+What did you learn?
+Challenges you encountered? 
 
 ## Syllabus
 
 | Week | Dates     | Topic / Lecture                                                          | Format                                                                              | Teacher |
 |------|-----------|--------------------------------------------------------------------------|-------------------------------------------------------------------------------------|---------|
-| 1    | 29/31.10. | [Introduction to Generative AI and Large Language Models (LLM)](week-1/) | 45 min lecture and 45 min coding lab                                                | Aarne   |
-| 2    | 05/07.11. | [Using LLMs and Prompting-based approaches](week-2/)                     | 45 min lecture and 45 min coding lab                                                | Aarne   |
-| 3    | 12/14.11. | [Evaluating LLMs](week-3/)                                               | 45 min lecture and 45 min coding lab                                                | Jussi   |
-| 4    | 9/21.11.  | [Fine-tuning LLMs](week-4/)                                              | 45 min lecture and 45 min coding lab                                                | Aarne   |
-| 5    | 26/28.11. | [Retrieval Augmented Generation (RAG)](week-5/)                          | [45 min lecture](https://www.youtube.com/watch?v=1GtBArPD-UA) and 45 min coding lab | Dmitry  |
-| 6    | 03/05.12. | [Use cases and applications of LLMs](week-6/)                            | [45 min lecture](https://www.youtube.com/watch?v=8LkR35wNZnU) and 45 min coding lab                                            | Dmitry  |
-| 7    | 10/12.12. | [Group project presentations](week-7/)                                   | Student project presentations                                                       | Aarne   |
+| 1    | 29/31.10. | [Introduction to Generative AI and Large Language Models (LLM)](week-1/) | 90 min lecture and 90 min lab                                                       | Aarne   |
+| 2    | 05/07.11. | [Using LLMs and Prompting-based approaches](week-2/)                     | 90 min lecture and 90 min coding lab                                                | Aarne   |
+| 3    | 12/14.11. | [Evaluating LLMs](week-3/)                                               | 90 min lecture and 90 min coding lab                                                | Jussi   |
+| 4    | 19/21.11. | [Fine-tuning LLMs](week-4/)                                              | 90 min lecture and 90 min coding lab                                                | Aarne   |
+| 5    | 26/28.11. | [Retrieval Augmented Generation (RAG)](week-5/)                          | [90 min lecture](https://www.youtube.com/watch?v=1GtBArPD-UA) and 90 min coding lab | Dmitry  |
+| 6    | 03/05.12. | [Use cases and applications of LLMs](week-6/)                            | [90 min lecture](https://www.youtube.com/watch?v=8LkR35wNZnU) and 90 min coding lab                                            | Dmitry  |
+| 7    | 10/12.12. | [Final report preparation](week-7/)                                      | Student work on their  final report                                                 | Aarne   |
 
 
 ### Detailed Syllabus:
 
 **Week 1: Introduction to Generative AI and Large Language Models (LLM)**
 * Overview of Generative AI and its applications in NLP
 * Introduction to Large Language Models (LLMs) and their architecture
-* Hands-on lab: Getting started with LLMs using popular libraries
+* Lab: Learn about tokenizers
 
 **Week 2: Using LLMs and Prompting-based approaches**
 * Understanding prompt engineering and its importance in working with LLMs
@@ -65,10 +74,10 @@ This hands-on course delves into the world of Large Language Models (LLMs) and t
 * Discussing the potential impact of LLMs on different industries
 * Hands-on lab: query tables and generate synthetic data
 
-**Week 7: Group project presentations**
-* Students present their final group projects plan, showcasing their understanding and application of LLMs in NLP.
+**Week 7: Final report preparation**
+* Students work on their final reports, showcasing their understanding of the labs and the concepts learned.
 
 **Group Project submission**
-* Final projects are submitted by 31st December 2024
+* Final reports are submitted by 31st December 2024
 
-Note: This syllabus is subject to change at the discretion of the instructors. Any modifications will be communicated to the students in a timely manner.
+Note: This syllabus is subject to change at the discretion of the instructors. Any modifications will be communicated to the students in a timely manner.
diff --git a/week-1/LLM-Course Lecture 1.pdf b/week-1/LLM-Course Lecture 1.pdf
diff --git a/week-1/README.md b/week-1/README.md
@@ -1 +1,16 @@
-# Introduction to Generative AI and Large Language Models (LLM)
+# Introduction to Generative AI and Large Language Models (LLM)
+
+Week 1 is the most theory-heavy week of the course. You can find the lecture slides here: [Week 1 Slides](https://github.com/Helsinki-NLP/LLM-course-2024/blob/main/week-1/LLM-Course%20Lecture%201.pdf).
+
+## Lab Assignment
+
+Research on Tokenizers and write a section to your final report reflecting on the following questions:
+* What are tokenizers?
+* Why are they important for language modeling and LLMs?
+* What different tokenization algorithms are there and which ones are the most popular ones and why?
+
+**Some references:**
+* Neural Machine Translation of Rare Words with Subword Units: 
+[https://arxiv.org/abs/1508.07909](https://arxiv.org/abs/1508.07909)
+* SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing: 
+[https://arxiv.org/abs/1808.06226](https://arxiv.org/abs/1808.06226)
diff --git a/week-2/.DS_Store b/week-2/.DS_Store
diff --git a/week-2/LLM-Course Lecture 2.pdf b/week-2/LLM-Course Lecture 2.pdf
diff --git a/week-2/README.md b/week-2/README.md
@@ -1 +1,34 @@
-# Using LLMs and Prompting-based approaches
+# Using LLMs and Prompting-based approaches
+
+## Lecture
+
+Slides can be found here: [Week 2 Slides](https://github.com/Helsinki-NLP/LLM-course-2024/blob/main/week-2/LLM-Course%20Lecture%202.pdf).
+
+
+## Preparation for the lab
+
+**Gemini API Key**
+* Use your existing google account or create a new free account
+* Go to https://aistudio.google.com/apikey
+* Get your API key and store it in a safe place
+
+**Create a HuggingFace account**
+* https://huggingface.co/join
+
+**Set up Python & Jupyter environment**
+* I use Python 3.12 for running code locally
+* For inference or fine-tuning requiring GPU, I use Google Colab: https://colab.research.google.com/
+
+
+## Lab Exercise
+
+For this exercise use the gemini-chatbot and/or prompting-notebook found in GitHub under week-2 folder. 
+Select a domain (e.g. finance, sports, cooking), tone of voice, style and persona (e.g. a pirate) and a question/task you want to accomplish (e.g. write a blog post)
+* Modify the gemini-chatbot and test the different prompting approaches discussed in the lecture to achieve the task.
+* Do the same for prompting-notebook (run this in Google Colab using a T4 GPU backend)
+* Write a section to your report explaining what you did and what were your findings. Which prompting approach worked the best and why? 
+
+Modify the in-context-learning notebook (you can run this locally or in Google Colab)
+* Modify the prompt to change the style of the output to be a table with strengths and weaknesses in separate columns. (Markdown printing should show the table correctly. If you have time, modify the html printing to show the updated style as a table).
+* If you have time: modify the notebook to use an open source model from Hugging Face instead of Gemini
+
diff --git a/week-2/gemini-chatbot/basic_chatbot.py b/week-2/gemini-chatbot/basic_chatbot.py
@@ -0,0 +1,61 @@
+from fasthtml.common import *
+import google.generativeai as genai
+import strip_markdown
+import configparser
+
+API_KEY = os.environ.get("GEMINI_API_KEY")
+genai.configure(api_key=API_KEY)
+LLM = "gemini-1.5-flash"
+model = genai.GenerativeModel(LLM)
+
+# Read system prompts from config file
+prompts = configparser.ConfigParser()
+prompts.read('prompts.env')
+
+# Set system prompt
+#system_prompt = prompts.get("SYSTEM_PROMPTS", "IT_HELPDESK")
+system_prompt = f'Summarize the following text about {prompts.get("TEMPLATES", "TOPIC")} in {prompts.get("TEMPLATES", "NUMBER")} bullet points:'
+
+# Set up the app, including daisyui and tailwind for the chat component
+hdrs = (picolink, Script(src="https://cdn.tailwindcss.com"),
+    Link(rel="stylesheet", href="https://cdn.jsdelivr.net/npm/[email protected]/dist/full.min.css"))
+app = FastHTML(hdrs=hdrs, cls="p-4 max-w-lg mx-auto")
+
+# Chat message component (renders a chat bubble)
+def ChatMessage(msg, user):
+    bubble_class = "chat-bubble-primary" if user else 'chat-bubble-secondary'
+    chat_class = "chat-end" if user else 'chat-start'
+    return Div(cls=f"chat {chat_class}")(
+               Div('user' if user else 'assistant', cls="chat-header"),
+               Div(msg, cls=f"chat-bubble {bubble_class}"),
+               Hidden(msg, name="messages")
+           )
+
+# The input field for the user message.
+def ChatInput():
+    return Input(name='msg', id='msg-input', placeholder="Type a message",
+                 cls="input input-bordered w-full", hx_swap_oob='true')
+
+# The main screen
+@app.get
+def index():
+    page = Form(hx_post=send, hx_target="#chatlist", hx_swap="beforeend")(
+           Div(id="chatlist", cls="chat-box h-[73vh] overflow-y-auto"),
+               Div(cls="flex space-x-2 mt-2")(
+                   Group(ChatInput(), Button("Send", cls="btn btn-primary"))
+               )
+           )
+    return Titled('Simple chatbot demo', page)
+
+# Handle the form submission
+@app.post
+def send(msg:str, messages:list[str]=None):
+    if not messages: messages = [system_prompt]
+    messages.append(msg.rstrip())
+    r = model.generate_content(messages).text
+    return (ChatMessage(msg, True),    # The user's message
+            ChatMessage(strip_markdown.strip_markdown(r.rstrip()), False), # The chatbot's response
+            ChatInput()) # And clear the input field
+
+serve()
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -240,3 +240,6 @@ cython_debug/ @@
     .idea/
     *.pyc
+    .sesskey
+    *.html