Skip to content

Commit

Permalink
Updating _toc.yml
Browse files Browse the repository at this point in the history
  • Loading branch information
dmatekenya committed Nov 7, 2024
1 parent cfa0b8b commit fea8c84
Show file tree
Hide file tree
Showing 25 changed files with 308 additions and 30 deletions.
13 changes: 13 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

all: build serve

build:
jupyter-book build . --config docs/_config.yml --toc docs/_toc.yml

serve:
echo "Enter Serve" && \
BASE_DIR="$(shell pwd)" && \
echo "Base Directory: $$BASE_DIR" && \
FULL_PATH="$$BASE_DIR/_build/html/index.html" && \
echo "Full Path: $$FULL_PATH" && \
start chrome "$$FULL_PATH"
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ This course has been delivered in different formats to cater to various audience
## Repository Structure and Contents
This repository serves as the primary resource for accessing course content, including slides, Python programming labs, example applications using LLMs, and additional materials to support learning about Generative AI and building applications with LLMs. For easy navigation, use the link and contents outlined below.

### Contents
## Contents

```{tableofcontents}
```
Expand Down
53 changes: 29 additions & 24 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,35 +7,40 @@ parts:
- file: docs/course-requirements/python-environment
- file: docs/course-requirements/data-science
- file: docs/course-requirements/platforms

- caption: Tunisia, May 2024
chapters:
- file: docs/tunisia-may-24/README
- file: docs/tunisia-may-24/module-1
- file: docs/tunisia-may-24/module-2
- file: docs/tunisia-may-24/module-3
- file: docs/tunisia-may-24/module-4
- file: docs/tunisia-may-24/project-ideas
- file: notebooks/tunisia-may-24/tunisia
- file: docs/tunisia-may-24/tunisia
- file: docs/tunisia-may-24/tun-modules
sections:
- file: notebooks/tunisia-may-24/1-text2sqL-demo.ipynb
- file: notebooks/tunisia-may-24/2-document-classification-with-sklearn.ipynb
- file: notebooks/tunisia-may-24/3-intro-langchain.ipynb

- caption: Malawi, Upcoming, November 2024
- file: docs/tunisia-may-24/tun-module-1
- file: docs/tunisia-may-24/tun-module-2
- file: docs/tunisia-may-24/tun-module-3
- file: docs/tunisia-may-24/tun-module-4
- file: docs/tunisia-may-24/tun-project
sections:
- file: docs/tunisia-may-24/tun-project-ideas
- file: docs/tunisia-may-24/streamlit-app-deployment
- file: notebooks/tunisia-may-24/tun-notebooks-intro
sections:
- file: notebooks/tunisia-may-24/text2sql-demo.ipynb
- file: notebooks/tunisia-may-24/document-classification-with-sklearn.ipynb
- file: notebooks/tunisia-may-24/intro-langchain.ipynb
- caption: Malawi, Upcoming (Dec 24)
chapters:
- file: docs/malawi-nov-24/README
- file: docs/malawi-nov-24/module-1
- file: docs/malawi-nov-24/module-2
- file: docs/malawi-nov-24/module-3
- file: docs/malawi-nov-24/module-4
- file: docs/malawi-nov-24/project-ideas
- file: notebooks/malawi-nov-24/malawi
- file: docs/malawi-nov-24/malawi
- file: docs/malawi-nov-24/mw-modules
sections:
- file: docs/malawi-nov-24/mw-module-1
- file: docs/malawi-nov-24/mw-module-2
- file: docs/malawi-nov-24/mw-module-3
- file: docs/malawi-nov-24/mw-module-4
- file: docs/malawi-nov-24/mw-project.md
sections:
- file: docs/malawi-nov-24/mw-project-ideas
- file: docs/malawi-nov-24/mw-streamlit-app
- file: notebooks/malawi-nov-24/mw-notebooks-intro
sections:
- file: notebooks/malawi-nov-24/1-text2sqL-demo.ipynb
- file: notebooks/malawi-nov-24/2-document-classification-with-sklearn.ipynb
- file: notebooks/malawi-nov-24/3-intro-langchain.ipynb

- caption: Acknowledgements
chapters:
- file: docs/team

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
1 change: 1 addition & 0 deletions docs/malawi-nov-24/mw-modules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Topics Covered
File renamed without changes.
1 change: 1 addition & 0 deletions docs/malawi-nov-24/mw-project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Min Project Instructions
55 changes: 55 additions & 0 deletions docs/malawi-nov-24/mw-streamlit-app.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Deploying a Chatbot on Streamlit
In this activity, you will use the knowledge gained from the LangChain Tutorial to explore a chatbot deployed on Streamlit. You will deploy this app on your computer and interact with it.

## About Streamlit

As discussed in the lectures, Streamlit is a platform that enables data scientists to deploy dynamic, data-based apps. It’s ideal for prototyping demonstration apps and sharing them with stakeholders before full-scale production deployment.

## Initial Setup and Getting the Chatbot Files

1. **Get OpenAI and Hugging Face API Credentials**
The chatbot uses OpenAI models, so you’ll need to sign up for an OpenAI developer account and obtain an API key. For a step-by-step guide on creating an OpenAI API key, search for instructions on ChatGPT. Similarly, create a Hugging Face account and obtain an API token.

2. **Try the Chatbot on Streamlit Community Cloud**
Before downloading anything, you can try the chatbot on the Streamlit Community Cloud with just the OpenAI and Hugging Face keys.

3. **Download or Clone the Project Repository**
To get the project files on your computer, either clone the GitHub repository (if familiar with Git) or download the repository as a zipped file.

## Deploying the Streamlit App Locally

1. **Unzip and Navigate to the Project Folder**
Once unzipped, open the project folder and follow the instructions on the GitHub page to deploy the chatbot.

2. **Follow steps on GitHub project repository**. [Streamlit app repo](https://github.com/worldbank/RAG-Based-ChatBot-Example)


3. **Install Required Packages**
The `requirements.txt` file contains a list of all required packages. If you encounter a missing package error, try installing the package again (ensuring your virtual environment is activated).

4. **Run the App Locally**
Run the app with the following command:
```bash
streamlit run streamlit_app.py
```
5. **Test and Check**. When deployed locally, you can browse the files being used in the app.

## Explore Important Scripts

The essential components for building a chatbot with LangChain are organized into distinct, modular Python scripts. Let’s explore some of these elements. You can use VS Code or your preferred text editor for this task.

### Loading Files
In real-life applications, you may need to load hundreds of documents, requiring a versatile function for file loading. This project includes two types of loaders:
- **`remote_loader.py`**: For loading documents from websites.
- **`local_loader.py`**: For loading documents from the local `data` folder.

### Document Splitting
The `splitter.py` module uses the `RecursiveCharacterTextSplitter` strategy, with a chunk size of 1000 and an overlap of 0. This method helps in breaking down large documents into manageable sections for processing.

### Prompt Chains
In the `full_chain.py`, `base_chain.py`, and `rag_chain.py` modules, you’ll find configurations for the specific LLM models and prompting strategies used. The project utilizes OpenAI chat models, with customized chains designed to guide interactions effectively.

### Memory Management
Memory management strategies are also implemented to optimize the chatbot’s performance, particularly for long interactions or when processing large datasets.


35 changes: 35 additions & 0 deletions docs/tunisia-may-24/tun-module-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Module 1: AI Foundations

### Module Objectives
The goal of this module is to introduce learners to the fields of machine learning and deep learning. By the end of this module, learners should understand how predictive models are built in Python and be able to distinguish between simple machine learning models, such as linear regression, and deep learning models. Learners will also gain an appreciation for how data is used to build ML models, the process of developing ML models and deploying them to production, and the infrastructure required to support ML systems.

### Module Topics
- **Machine Learning (ML) and Neural Networks**
- Problem formulation and techniques: Regression, Nearest Neighbors, Tree-Based Models, Clustering, Principal Component Analysis.
- **Major ML Application Areas**
- Natural Language Processing (NLP), Computer Vision, Recommender Systems.
- **Platforms for Building ML Models**
- Python for ML and Data Science.
- **Machine Learning vs. Statistics**
- Similarities and Differences.
- **Tools and Platforms**
- Python, scikit-learn, PyTorch, and cloud-based platforms.
- **Building ML Systems**
- Data preparation, model training and evaluation, model deployment, and serving.

### ML Use Cases

### Practical Labs
- **Traditional ML**
- Build a predictive model to replace/impute missing data.
- Build a predictive model for predicting poverty from LSM data.
- **Deep Learning**
- Build a simple computer vision model.
- **Deep Learning-NLP**: Build a document classification system.

### Case Studies
- **[World Bank]** Small area estimation of poverty.
- **[World Bank]** Object detection from high-resolution satellite imagery.

### Assessment
- To be determined (TBD).
58 changes: 58 additions & 0 deletions docs/tunisia-may-24/tun-module-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Module 2: Introduction to Generative AI and LLMs

### Module Objectives
This module provides foundational knowledge on Large Language Models (LLMs), covering key concepts such as pretraining, foundational models, and adapting LLMs through fine-tuning. Additionally, the module introduces various open-source and proprietary LLMs currently available on the market.

### Module Topics
- **Introducing Generative AI**
- What is Generative AI?
- How Gen AI differs from Predictive AI.
- Brief history of Gen AI.
- Capabilities of Gen AI and major use cases.
- Different categories of Gen AI (LLMs, image generators, video generators).

- **Understanding Large Language Models (LLMs)**
- Overview and history of language models—LLMs vs. SLMs.
- Categories of LLMs: Foundation models and other concepts.
- Building LLMs: Transformer architecture and sequence-to-sequence architectures.
- Overview of common LLMs: OpenAI models, Mistral AI, Llama, Gemini, and others.
- Adapting and customizing LLMs: fine-tuning, pre-training, RLHF.

- **Building and Evaluating LLM Apps**
- Key concepts for LLM apps: prompt engineering, prompt-tuning, vector embeddings, RAG.
- Ecosystem of commercial and open-source tools for building LLM apps (e.g., LangChain).
- Customizing LLMs for specific use cases: prompt engineering, RAG, fine-tuning, RLHF.
- Selecting and evaluating LLMs and LLM apps.

- **Deploying LLM Apps with LangChain**
- Overview of LangChain features and capabilities.
- Preprocessing and loading data in LangChain.
- Working with different LangChain agents (e.g., SQL).
- Deploying LangChain applications (e.g., with Streamlit, WhatsApp, and web apps).
- Evaluating LangChain apps.

### Practical Labs

- **Lab 1: Demonstration of Building an LLM App with Commercial Tools (OpenAI)**
Since participants won’t have access to a paid OpenAI subscription, the instructor will demonstrate available capabilities for building LLM apps. The lab will include:
- Exploring OpenAI features using the ChatGPT GUI (paid version), showing functionalities and how to create assistants.
- Demonstrating a simple RAG-based chatbot using the OpenAI playground.
- Demonstrating a simple RAG-based chatbot using the OpenAI API.

- **Lab 2: Building LLM Apps Using LangChain (RAG-Based Chatbot)**
Participants will use provided documents to build a RAG-based app with an open-source LLM to query the documents. The output will be a Streamlit app for sharing. Tasks include:
- Setting up the development environment and installing required packages.
- Preparing source data (e.g., health documents).
- Setting up a vector database.
- Preprocessing documents and loading them into the vector database.
- Integrating with an LLM (including selecting the LLM).
- Developing the user interface in Streamlit.
- Deploying and testing the app.

### Case Study
- **Agricultural Information Q&A System in Malawi**
A RAG-based chatbot deployed in Malawi answers questions from Agricultural Extension workers. This example app uses ChatGPT (OpenAI) integrated with agricultural documents from Malawi and is deployed on WhatsApp.

### Assessment
- **Build an LLM App with LangChain**
Participants will receive a notebook to create an app that answers questions based on their selected website. Additionally, a quiz with five multiple-choice questions will be administered.
39 changes: 39 additions & 0 deletions docs/tunisia-may-24/tun-module-3.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Module 3: Gen AI and LLM Applications in Statistics

### Module Objectives
This module explores both current and potential applications of Gen AI and LLMs in the field of statistics. Covering the entire statistical life cycle—from data collection and processing to analysis and dissemination—we examine how LLMs can enhance each stage. For instance, Ask a Question (AAQ) platforms can interpret and respond to natural language queries, providing relevant statistical information. Learners will be introduced to tools for creating accessible platforms, like a WhatsApp bot, that can answer questions using statistical data as its knowledge base.

### Module Topics
- **Qualitative and Multi-Modal Data Analysis with LLMs**
- **Advanced Image Analysis in Statistical Data Collection**
- **Text Data Analysis with LLMs**
- Applications like sentiment analysis, parsing web-scraped price data, analyzing qualitative research data, and more.
- **Audio Data Processing and Analysis with Speech Models**
- For example, processing data from focus group discussions (FGDs) or interview data.
- **LLM Applications in Data Dissemination**
- LLMs in data discovery: Semantic search vs. keyword search.
- Enhancing and automating metadata generation with LLMs.
- Statbots: Chatbots that can respond to statistical queries.

- **Concepts in LLM Statbots**
- LLMs' quantitative reasoning abilities and capacity to work with tabular data.
- Strategies for connecting an LLM to statistical data: Text2SQL, Text2API, Text2Code, and more.
- Tools for parsing and working with tabular documents (e.g., DocumentLLM, LangChain SQL agent).
- Security considerations for Text2SQL and database connections.

- **Building a Statbot**
- LLM selection guide.
- Tool selection.
- Deploying statbots on platforms like WhatsApp, websites, and more.

### Practical Labs

- **Lab 1: Building a Health Statbot**
Participants will use a set of provided documents to build a RAG-based app using an open-source LLM to query the data. The lab will result in a Streamlit app that can be shared.

### Case Study
- **Accessing Databases with LLMs**
(Details TBD)

### Assessment
- To be determined (TBD).
26 changes: 26 additions & 0 deletions docs/tunisia-may-24/tun-module-4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Module 5: Case Studies and Project Work

### Module Objectives
This module focuses on real-world use cases, guiding participants through practical applications of LLM technology.

### Module Topics
- **Implementing LLM Apps**
- Differences between open-source and proprietary models.
- Approaches such as fine-tuning vs. RAG.
- Platform selection and performance evaluation.
- **Major Use Cases in Data Applications**
- Existing use cases (TBD) and potential applications in low-income regions.
- Implementation challenges.
- **Capstone Project**
- Hands-on project work where participants apply the concepts learned throughout the course.

### Practical Labs
- **Building an LLM Project**
Participants will work on an LLM project designed to apply the knowledge gained in a data-centric use case.

### Case Study
- **Building an LLM Project for Data Applications**
(Details TBD)

### Assessment
- **LLM Project Work**
1 change: 1 addition & 0 deletions docs/tunisia-may-24/tun-modules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Topics Covered
44 changes: 44 additions & 0 deletions docs/tunisia-may-24/tun-project-ideas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# AI and LLM Project Project Ideas

In this document, we will guide you through three key considerations for implementing your project in this course: choosing a project, defining acceptable outputs, and understanding the project selection process.

## Recommended Projects

This section addresses the question, "What project can I do?" Based on the course content, we provide recommended projects, but you are encouraged to explore other ideas.

### QA Chatbots
One common use case for generative AI is creating conversational systems like chatbots that can answer questions on specific topics. While chat-GPT and other models can handle general questions, they lack access to custom organizational data. For example, in the health domain, you may want to create a chatbot that answers questions on public health issues in your country. By using LLMs, you can create custom chatbots with access to specialized documents or websites for local knowledge.

> Note: For QA chatbots, we will focus on those that respond to textual questions rather than numeric or data-intensive information.
### Statsbots
Similar to the QA chatbot, a Statsbot is designed to answer quantitative questions. LLMs traditionally struggle with numeric data, so specialized tools are necessary for chatbots that work with tabular data and provide accurate, data-driven answers.

### Miscellaneous Document Analysis
LLMs are highly effective for analyzing documents, classifying them, and performing various NLP tasks. Examples of document analysis projects include:
- Sentiment Analysis
- Topic Classification
- Intent Classification
- Named Entity Recognition (NER)
- Document Type Classification
- Key Phrase Extraction
- Toxicity and Hate Speech Detection

## Guidelines for Choosing a Project

Select your project thoughtfully, given the limited time available. Here are factors to consider:

- **Data Availability**: Ensure that the necessary data or documents are accessible for the project.
- **Skills and Knowledge**: Assess the required platforms or tools and confirm that team members are willing to learn and work with them.
- **Effort**: Be realistic about the project's scope. Certain tasks, like fine-tuning an LLM, may require additional time and resources.
- **Cost**: Some LLM platforms and tools may require subscriptions or fees. For example, using the chat-GPT API requires a developer account with sufficient funds. While paid platforms are sometimes necessary, ensure that you understand the associated requirements.

## Permissible Project Outputs

We recommend including three key components as project outputs:

1. **User Interface**
Implementing LLMs often involves facilitating user interaction with documents, data, or other elements. For a more user-friendly experience, we suggest creating a user interface, such as a web-based UI, WhatsApp chatbot, or command-line tool.

2. **Documentation on GitHub**
As this is a technical project, you’ll write substantial code. Using a version control system like GitHub is recommended to track yo
1 change: 1 addition & 0 deletions docs/tunisia-may-24/tun-project.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Min Project Instructions
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Generative AI and LLMs for Data Literacy

The first iteration of this course was delivered in Tunis, Tunisia, from May 27 to May 31, as part of the Data in Health Program organized by the World Bank Group and the African Development Bank.
# Session Overview
The first iteration of this course was delivered in Tunis, Tunisia, from May 27 to May 31, as part of the Data in Health Program organized by the World Bank Group and the African Development Bank.
> Course Title: Generative AI and LLMs for Data Literacy
## Session Details

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# Programming Activities for the Course

# Programming Activities
This document outlines the programming activities for the course, focusing on hands-on projects to apply the concepts learned. The document is organized into two main sections: an introduction to LLM capabilities and LangChain, followed by a practical exercise on deploying a chatbot with Streamlit.

## LLM Foundations-understanding the ML Process
Expand Down

0 comments on commit fea8c84

Please sign in to comment.