diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index 3aed3267..6b9a0e55 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -17,7 +17,7 @@ jobs: package_name: cookbook path_to_docs: cookbook/notebooks/ additional_args: --not_python_module - languages: en zh-CN + languages: en zh-CN ru convert_notebooks: true secrets: hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} \ No newline at end of file diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index 64aaf9fe..a1a89b02 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -20,5 +20,5 @@ jobs: package_name: cookbook path_to_docs: cookbook/notebooks/ additional_args: --not_python_module - languages: en zh-CN + languages: en zh-CN ru convert_notebooks: true \ No newline at end of file diff --git a/notebooks/ru/_toctree.yml b/notebooks/ru/_toctree.yml new file mode 100644 index 00000000..edebc08c --- /dev/null +++ b/notebooks/ru/_toctree.yml @@ -0,0 +1,11 @@ +- title: Книга Open-Source AI рецептов + sections: + - local: index + title: Книга Open-Source AI рецептов + +- title: Рецепты LLM и RAG с использованием сторонних библиотек + sections: + - local: rag_with_hugging_face_gemma_mongodb + title: Создание системы RAG с помощью Gemma, MongoDB и моделей с открытым исходным кодом + - local: rag_zephyr_langchain + title: Простой RAG для проблем GitHub с использованием Hugging Face Zephyr и LangChain \ No newline at end of file diff --git a/notebooks/ru/index.md b/notebooks/ru/index.md new file mode 100644 index 00000000..a39c21a3 --- /dev/null +++ b/notebooks/ru/index.md @@ -0,0 +1,19 @@ +# Книга Open-Source AI рецептов + +Книга Open-Source AI рецептов - это коллекция блокнотов, иллюстрирующих практические аспекты создания +приложений AI и решения различных задач машинного обучения с помощью инструментов и моделей с открытым исходным кодом. + +## Последние блокноты + +Изучите недавно добавленные блокноты: + +- [Создание системы RAG с помощью Gemma, MongoDB и моделей с открытым исходным кодом](rag_with_hugging_face_gemma_mongodb) +- [Простой RAG для проблем GitHub с использованием Hugging Face Zephyr и LangChain](rag_zephyr_langchain) + +Вы также можете посмотреть блокноты в книге рецептов [репозитория GitHub](https://github.com/huggingface/cookbook). + +## Вклад + +Книга Open-Source AI рецептов - это работа сообщества, и мы приветствуем вклад каждого! +Изучите [Руководство по внесению вклада](https://github.com/huggingface/cookbook/blob/main/README.md) в книгу рецептов, чтобы +узнать как добавить свой рецепт. diff --git a/notebooks/ru/rag_with_hugging_face_gemma_mongodb.ipynb b/notebooks/ru/rag_with_hugging_face_gemma_mongodb.ipynb new file mode 100644 index 00000000..51eb30d3 --- /dev/null +++ b/notebooks/ru/rag_with_hugging_face_gemma_mongodb.ipynb @@ -0,0 +1,4039 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Создание системы RAG с помощью Gemma, MongoDB и моделей с открытым исходным кодом\n", + "\n", + "Автор: [Ричмонд Алейк](https://huggingface.co/RichmondMongo)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 1: Установка библиотек\n", + "\n", + "\n", + "Приведенная ниже последовательность команд оболочки устанавливает библиотеки для использования открытых больших языковых моделей (LLM), моделей эмбеддингов и функций взаимодействия с базами данных. Эти библиотеки упрощают разработку системы RAG, снижая сложность до небольшого количества кода:\n", + "\n", + "\n", + "- PyMongo: Библиотека Python для взаимодействия с MongoDB, позволяющая подключаться к кластеру и запрашивать данные, хранящиеся в коллекциях и документах.\n", + "- Pandas: Предоставляет структуру данных для эффективной обработки и анализа данных с помощью Python.\n", + "- Hugging Face datasets: Содержит аудио-, визуальные и текстовые датасеты.\n", + "- Hugging Face Accelerate: Абстрагируется от сложности написания кода, использующего аппаратные ускорители, такие как GPU. Accelerate используется в этой реализации для выполнения модели Gemma на ресурсах GPU.\n", + "- Hugging Face Transformers: Доступ к обширной коллекции предварительно обученных моделей\n", + "- Hugging Face Sentence Transformers: Предоставляет доступ к эмбеддингам предложений, текстов и изображений." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gVSo_nNOUsdn", + "outputId": "907f4738-a3b0-4c0f-b293-eff65c665c07" + }, + "outputs": [], + "source": [ + "!pip install datasets pandas pymongo sentence_transformers\n", + "!pip install -U transformers\n", + "# Установите библиотеку ниже, если вы используете GPU\n", + "!pip install accelerate" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 2: Сбор и подготовка данных\n", + "\n", + "\n", + "Данные, используемые в этом руководстве, взяты из датасетов Hugging Face, а именно \n", + "[AIatMongoDB/embedded_movies](https://huggingface.co/datasets/AIatMongoDB/embedded_movies). " + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 747 + }, + "id": "5gCzss27UwWw", + "outputId": "212cca18-a0d7-4289-bce0-ee6259fc2dba" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1500,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.09038552453906,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 139,\n \"samples\": [\n 152.0,\n 127.0,\n 96.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.861995960390892,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1368,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BNWE5MzAwMjQtNzI1YS00YjZhLTkxNDItM2JjNjM3ZjI5NzBjXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTgwNjIyNTczMF5BMl5BanBnXkFtZTcwODI5MDkyMQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1429,\n \"samples\": [\n \"A New York City architect becomes a one-man vigilante squad after his wife is murdered by street punks in which he randomly goes out and kills would-be muggers on the mean streets after dark.\",\n \"As the daring thief Ars\\u00e8ne Lupin (Duris) ransacks the homes of wealthy Parisians, the police, with a secret weapon in their arsenal, attempt to ferret him out.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot_embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1435,\n \"samples\": [\n \"Turbo: A Power Rangers Movie\",\n \"Neon Genesis Evangelion: Death & Rebirth\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcastplot_embeddingtitle
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...[0.00072939653, -0.026834568, 0.013515796, -0....The Perils of Pauline
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...[-0.022837115, -0.022941574, 0.014937485, -0.0...From Hand to Mouth
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...[0.00023330493, -0.028511643, 0.014653289, -0....Beau Geste
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...[-0.005927917, -0.033394486, 0.0015323418, -0....The Black Pirate
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...[-0.0059373598, -0.026604708, -0.0070914757, -...For Heaven's Sake
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast \\\n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... \n", + "\n", + " plot_embedding title \n", + "0 [0.00072939653, -0.026834568, 0.013515796, -0.... The Perils of Pauline \n", + "1 [-0.022837115, -0.022941574, 0.014937485, -0.0... From Hand to Mouth \n", + "2 [0.00023330493, -0.028511643, 0.014653289, -0.... Beau Geste \n", + "3 [-0.005927917, -0.033394486, 0.0015323418, -0.... The Black Pirate \n", + "4 [-0.0059373598, -0.026604708, -0.0070914757, -... For Heaven's Sake " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Загрузка датасета\n", + "from datasets import load_dataset\n", + "import pandas as pd\n", + "\n", + "# https://huggingface.co/datasets/AIatMongoDB/embedded_movies\n", + "dataset = load_dataset(\"AIatMongoDB/embedded_movies\")\n", + "\n", + "# Конвертация датасета во фрейм данных Pandas\n", + "dataset_df = pd.DataFrame(dataset[\"train\"])\n", + "\n", + "dataset_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Операции в следующем фрагменте кода направлены на обеспечение целостности и качества данных. \n", + "1. Первый процесс гарантирует, что атрибут `fullplot` каждой точки данных не пуст, поскольку это первичные данные, которые мы используем в процессе получения эмбеддингов. \n", + "2. Этот шаг также гарантирует, что мы удалим атрибут `plot_embedding` из всех точек данных, поскольку он будет заменен новыми эмбеддингами, созданными с помощью другой модели эмбеддингов, `gte-large`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "ARdz6j7SUxqi", + "outputId": "c53c458a-512d-4b7e-93b4-514f6de9d497" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Number of missing values in each column after removal:\n", + "num_mflix_comments 0\n", + "genres 0\n", + "countries 0\n", + "directors 12\n", + "fullplot 0\n", + "writers 13\n", + "awards 0\n", + "runtime 14\n", + "type 0\n", + "rated 279\n", + "metacritic 893\n", + "poster 78\n", + "languages 1\n", + "imdb 0\n", + "plot 0\n", + "cast 1\n", + "plot_embedding 1\n", + "title 0\n", + "dtype: int64\n" + ] + }, + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1452,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.5693352357647,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 137,\n \"samples\": [\n 60.0,\n 151.0,\n 110.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.855402595666057,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1332,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BMTQ2NTMxODEyNV5BMl5BanBnXkFtZTcwMDgxMjA0MQ@@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTY5OTg1ODk0MV5BMl5BanBnXkFtZTcwMTEwMjU1MQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla saves Tokyo from a flying saucer that transforms into the beast Orga.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1391,\n \"samples\": [\n \"Superhero Movie\",\n \"Hooper\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcasttitle
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...The Perils of Pauline
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...From Hand to Mouth
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...Beau Geste
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...The Black Pirate
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...For Heaven's Sake
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast title \n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... The Perils of Pauline \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... From Hand to Mouth \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... Beau Geste \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... The Black Pirate \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... For Heaven's Sake " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Подготовка данных\n", + "\n", + "# Удалим точки данных, в которых отсутствует столбец графика\n", + "dataset_df = dataset_df.dropna(subset=[\"fullplot\"])\n", + "print(\"\\nNumber of missing values in each column after removal:\")\n", + "print(dataset_df.isnull().sum())\n", + "\n", + "# Удаляем plot_embedding из каждой точки данных датасета, так как мы собираемся создать новые эмбеддинги с помощью модели эмбеддингов c открытым исходным кодом из Hugging Face\n", + "dataset_df = dataset_df.drop(columns=[\"plot_embedding\"])\n", + "dataset_df.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 3: Генерация эмбеддингов\n", + "\n", + "**В ячейке кода описаны следующие шаги:**.\n", + "1. Импортируем класс `SentenceTransformer` для доступа к моделям эмбеддингов.\n", + "2. Загрузим модель эмбеддингов с помощью конструктора `SentenceTransformer` для инстанцирования модели эмбеддингов `gte-large`.\n", + "3. Определим функцию `get_embedding`, которая принимает на вход текстовую строку и возвращает список значений с плавающей точкой, представляющих эмбеддинги. Сначала функция проверяет, не пуст ли входной текст (после удаления пробельных символов). Если текст пуст, она возвращает пустой список. В противном случае она генерирует эмбеддинги, используя загруженную модель.\n", + "4. Генерируем эмбеддинги, применяя функцию `get_embedding` к столбцу \"fullplot\" фрейма данных `dataset_df`, генерируя эмбеддинги для каждого сюжета фильма. Полученный список эмбеддингов присваивается новому столбцу с именем embedding.\n", + "\n", + "*Примечание: нет необходимости разбивать текст на фрагменты (chunk) в полном описании сюжета, так как мы можем гарантировать, что длина текста остается в пределах допустимого диапазона*.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 747 + }, + "id": "ZX8zJNN5UzPK", + "outputId": "81bc1a57-7d96-4311-ba94-4748c34c20e3" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"dataset_df\",\n \"rows\": 1452,\n \"fields\": [\n {\n \"column\": \"num_mflix_comments\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 27,\n \"min\": 0,\n \"max\": 158,\n \"num_unique_values\": 40,\n \"samples\": [\n 117,\n 134,\n 124\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"genres\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"countries\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"directors\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"fullplot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla returns in a brand-new movie that ignores all preceding movies except for the original with a brand new look and a powered up atomic ray. This time he battles a mysterious UFO that later transforms into a mysterious kaiju dubbed Orga. They meet up for the final showdown in the city of Shinjuku.\",\n \"Relationships become entangled in an emotional web.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"writers\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"awards\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"runtime\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 42.5693352357647,\n \"min\": 6.0,\n \"max\": 1256.0,\n \"num_unique_values\": 137,\n \"samples\": [\n 60.0,\n 151.0,\n 110.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"type\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"series\",\n \"movie\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"rated\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 12,\n \"samples\": [\n \"TV-MA\",\n \"TV-14\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"metacritic\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 16.855402595666057,\n \"min\": 9.0,\n \"max\": 97.0,\n \"num_unique_values\": 83,\n \"samples\": [\n 50.0,\n 97.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"poster\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1332,\n \"samples\": [\n \"https://m.media-amazon.com/images/M/MV5BMTQ2NTMxODEyNV5BMl5BanBnXkFtZTcwMDgxMjA0MQ@@._V1_SY1000_SX677_AL_.jpg\",\n \"https://m.media-amazon.com/images/M/MV5BMTY5OTg1ODk0MV5BMl5BanBnXkFtZTcwMTEwMjU1MQ@@._V1_SY1000_SX677_AL_.jpg\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"languages\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"imdb\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"plot\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1409,\n \"samples\": [\n \"An undercover cop infiltrates a gang of thieves who plan to rob a jewelry store.\",\n \"Godzilla saves Tokyo from a flying saucer that transforms into the beast Orga.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"cast\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"title\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1391,\n \"samples\": [\n \"Superhero Movie\",\n \"Hooper\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"embedding\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "dataset_df" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
num_mflix_commentsgenrescountriesdirectorsfullplotwritersawardsruntimetyperatedmetacriticposterlanguagesimdbplotcasttitleembedding
00[Action][USA][Louis J. Gasnier, Donald MacKenzie]Young Pauline is left a lot of money when her ...[Charles W. Goddard (screenplay), Basil Dickey...{'nominations': 0, 'text': '1 win.', 'wins': 1}199.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzgxOD...[English]{'id': 4465, 'rating': 7.6, 'votes': 744}Young Pauline is left a lot of money when her ...[Pearl White, Crane Wilbur, Paul Panzer, Edwar...The Perils of Pauline[-0.009285838343203068, -0.005062104668468237,...
10[Comedy, Short, Action][USA][Alfred J. Goulding, Hal Roach]As a penniless man worries about how he will m...[H.M. Walker (titles)]{'nominations': 1, 'text': '1 nomination.', 'w...22.0movieTV-GNaNhttps://m.media-amazon.com/images/M/MV5BNzE1OW...[English]{'id': 10146, 'rating': 7.0, 'votes': 639}A penniless young man tries to save an heiress...[Harold Lloyd, Mildred Davis, 'Snub' Pollard, ...From Hand to Mouth[-0.0024393785279244184, 0.02309592440724373, ...
20[Action, Adventure, Drama][USA][Herbert Brenon]Michael \"Beau\" Geste leaves England in disgrac...[Herbert Brenon (adaptation), John Russell (ad...{'nominations': 0, 'text': '1 win.', 'wins': 1}101.0movieNoneNaNNone[English]{'id': 16634, 'rating': 6.9, 'votes': 222}Michael \"Beau\" Geste leaves England in disgrac...[Ronald Colman, Neil Hamilton, Ralph Forbes, A...Beau Geste[0.012204292230308056, -0.01145575474947691, -...
31[Adventure, Action][USA][Albert Parker]A nobleman vows to avenge the death of his fat...[Douglas Fairbanks (story), Jack Cunningham (a...{'nominations': 0, 'text': '1 win.', 'wins': 1}88.0movieNoneNaNhttps://m.media-amazon.com/images/M/MV5BMzU0ND...None{'id': 16654, 'rating': 7.2, 'votes': 1146}Seeking revenge, an athletic young man joins t...[Billie Dove, Tempe Pigott, Donald Crisp, Sam ...The Black Pirate[0.004541348200291395, -0.0006100579630583525,...
40[Action, Comedy, Romance][USA][Sam Taylor]The Uptown Boy, J. Harold Manners (Lloyd) is a...[Ted Wilde (story), John Grey (story), Clyde B...{'nominations': 1, 'text': '1 nomination.', 'w...58.0moviePASSEDNaNhttps://m.media-amazon.com/images/M/MV5BMTcxMT...[English]{'id': 16895, 'rating': 7.6, 'votes': 918}An irresponsible young millionaire changes his...[Harold Lloyd, Jobyna Ralston, Noah Young, Jim...For Heaven's Sake[-0.0022256041411310434, 0.011567804962396622,...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "text/plain": [ + " num_mflix_comments genres countries \\\n", + "0 0 [Action] [USA] \n", + "1 0 [Comedy, Short, Action] [USA] \n", + "2 0 [Action, Adventure, Drama] [USA] \n", + "3 1 [Adventure, Action] [USA] \n", + "4 0 [Action, Comedy, Romance] [USA] \n", + "\n", + " directors \\\n", + "0 [Louis J. Gasnier, Donald MacKenzie] \n", + "1 [Alfred J. Goulding, Hal Roach] \n", + "2 [Herbert Brenon] \n", + "3 [Albert Parker] \n", + "4 [Sam Taylor] \n", + "\n", + " fullplot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 As a penniless man worries about how he will m... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 A nobleman vows to avenge the death of his fat... \n", + "4 The Uptown Boy, J. Harold Manners (Lloyd) is a... \n", + "\n", + " writers \\\n", + "0 [Charles W. Goddard (screenplay), Basil Dickey... \n", + "1 [H.M. Walker (titles)] \n", + "2 [Herbert Brenon (adaptation), John Russell (ad... \n", + "3 [Douglas Fairbanks (story), Jack Cunningham (a... \n", + "4 [Ted Wilde (story), John Grey (story), Clyde B... \n", + "\n", + " awards runtime type rated \\\n", + "0 {'nominations': 0, 'text': '1 win.', 'wins': 1} 199.0 movie None \n", + "1 {'nominations': 1, 'text': '1 nomination.', 'w... 22.0 movie TV-G \n", + "2 {'nominations': 0, 'text': '1 win.', 'wins': 1} 101.0 movie None \n", + "3 {'nominations': 0, 'text': '1 win.', 'wins': 1} 88.0 movie None \n", + "4 {'nominations': 1, 'text': '1 nomination.', 'w... 58.0 movie PASSED \n", + "\n", + " metacritic poster languages \\\n", + "0 NaN https://m.media-amazon.com/images/M/MV5BMzgxOD... [English] \n", + "1 NaN https://m.media-amazon.com/images/M/MV5BNzE1OW... [English] \n", + "2 NaN None [English] \n", + "3 NaN https://m.media-amazon.com/images/M/MV5BMzU0ND... None \n", + "4 NaN https://m.media-amazon.com/images/M/MV5BMTcxMT... [English] \n", + "\n", + " imdb \\\n", + "0 {'id': 4465, 'rating': 7.6, 'votes': 744} \n", + "1 {'id': 10146, 'rating': 7.0, 'votes': 639} \n", + "2 {'id': 16634, 'rating': 6.9, 'votes': 222} \n", + "3 {'id': 16654, 'rating': 7.2, 'votes': 1146} \n", + "4 {'id': 16895, 'rating': 7.6, 'votes': 918} \n", + "\n", + " plot \\\n", + "0 Young Pauline is left a lot of money when her ... \n", + "1 A penniless young man tries to save an heiress... \n", + "2 Michael \"Beau\" Geste leaves England in disgrac... \n", + "3 Seeking revenge, an athletic young man joins t... \n", + "4 An irresponsible young millionaire changes his... \n", + "\n", + " cast title \\\n", + "0 [Pearl White, Crane Wilbur, Paul Panzer, Edwar... The Perils of Pauline \n", + "1 [Harold Lloyd, Mildred Davis, 'Snub' Pollard, ... From Hand to Mouth \n", + "2 [Ronald Colman, Neil Hamilton, Ralph Forbes, A... Beau Geste \n", + "3 [Billie Dove, Tempe Pigott, Donald Crisp, Sam ... The Black Pirate \n", + "4 [Harold Lloyd, Jobyna Ralston, Noah Young, Jim... For Heaven's Sake \n", + "\n", + " embedding \n", + "0 [-0.009285838343203068, -0.005062104668468237,... \n", + "1 [-0.0024393785279244184, 0.02309592440724373, ... \n", + "2 [0.012204292230308056, -0.01145575474947691, -... \n", + "3 [0.004541348200291395, -0.0006100579630583525,... \n", + "4 [-0.0022256041411310434, 0.011567804962396622,... " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "# https://huggingface.co/thenlper/gte-large\n", + "embedding_model = SentenceTransformer(\"thenlper/gte-large\")\n", + "\n", + "\n", + "def get_embedding(text: str) -> list[float]:\n", + " if not text.strip():\n", + " print(\"Попытка получить эмбеддинг для пустого текста.\")\n", + " return []\n", + "\n", + " embedding = embedding_model.encode(text)\n", + "\n", + " return embedding.tolist()\n", + "\n", + "\n", + "dataset_df[\"embedding\"] = dataset_df[\"fullplot\"].apply(get_embedding)\n", + "\n", + "dataset_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 4: Настройка и подключение к базе данных\n", + "\n", + "MongoDB выступает в качестве операционной, так и векторной базы данных. Она предлагает решение базы данных, которое эффективно хранит, запрашивает и извлекает эмбеддинги векторов - преимущества этого решения заключаются в простоте обслуживания, управления и стоимости базы данных.\n", + "\n", + "**Чтобы создать новую базу данных MongoDB, настройте кластер баз данных:**\n", + "\n", + "1. Перейдите на официальный сайт MongoDB и зарегистрируйте [бесплатную учетную запись MongoDB Atlas](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alake), или для существующих пользователей - [войдите в MongoDB Atlas](https://account.mongodb.com/account/login?utm_campaign=devrel&utm_source=community&utm_medium=cta&utm_content=Partner%20Cookbook&utm_term=richmond.alakee).\n", + "\n", + "2. Выберите параметр 'Database' на левой панели, что приведет к переходу на страницу развертывания базы данных, где есть спецификация развертывания любого существующего кластера. Создайте новый кластер базы данных, нажав на кнопку \"+Create\".\n", + "\n", + "3. Выберите все применимые конфигурации для кластера баз данных. Когда все параметры конфигурации выбраны, нажмите кнопку “Create Cluster”, чтобы развернуть только что созданный кластер. MongoDB также позволяет создавать бесплатные кластеры на вкладке \"Shared Tab\".\n", + "\n", + "*Примечание: Не забудьте внести в белый список IP-адрес хоста Python или 0.0.0.0/0 для любого IP-адреса при создании пробного варианта*.\n", + "\n", + "4. После успешного создания и развертывания кластера он становится доступным на странице ‘Database Deployment’.\n", + "\n", + "5. Нажмите на кнопку “Connect” кластера, чтобы увидеть возможность установки соединения с кластером через различные языковые драйверы.\n", + "\n", + "6. В этом руководстве требуется только URI (уникальный идентификатор ресурса) кластера. Скопируйте этот URI и вставьте его в окружение Google Colabs Secrets в переменную с именем `MONGO_URI` или добавьте его в файл .env или аналогичный.\n", + "\n", + "### 4.1 Настройка базы данных и коллекции\n", + "\n", + "Прежде чем двигаться дальше, убедитесь, что выполнены следующие предварительные условия\n", + "- Кластер баз данных задан как MongoDB Atlas\n", + "- Получен URI вашего кластера\n", + "\n", + "За помощью в настройке кластера баз данных и получении URI обращайтесь к нашему руководству по [настройке кластера MongoDB](https://www.mongodb.com/docs/guides/atlas/cluster/) и [получению строки подключения](https://www.mongodb.com/docs/guides/atlas/connection-string/)\n", + "\n", + "После создания кластера создайте базу данных и коллекцию в кластере MongoDB Atlas, нажав + Create Database на странице описания кластера. \n", + "\n", + "Вот руководство по [созданию базы данных и коллекции](https://www.mongodb.com/basics/create-database)\n", + "\n", + "**База данных будет называться `movies`.**\n", + "\n", + "**Коллекция будет называться `movie_collection_2`.**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 5: Создание индекса векторного поиска\n", + "\n", + "На этом этапе убедитесь, что ваш векторный индекс создан через MongoDB Atlas.\n", + "\n", + "Этот шаг является обязательным для проведения эффективного и точного векторного поиска на основе векторов эмбеддингов, хранящихся в документах коллекции `movie_collection_2`. \n", + "\n", + "Создание индекса векторного поиска позволяет эффективно просматривать и извлекать документы с эмбеддингами, соответствующими эмбеддингам запроса, на основе векторного сходства (vector similarity). \n", + "\n", + "Перейдите сюда, чтобы узнать больше о [Индексе векторного поиска MongoDB](https://www.mongodb.com/docs/atlas/atlas-search/field-types/knn-vector/).\n", + "\n", + "\n", + "```\n", + "{\n", + " \"fields\": [{\n", + " \"numDimensions\": 1024,\n", + " \"path\": \"embedding\",\n", + " \"similarity\": \"cosine\",\n", + " \"type\": \"vector\"\n", + " }]\n", + "}\n", + "\n", + "```\n", + "\n", + "Значение `1024` в поле numDimension соответствует размеру вектора, сгенерированного моделью эмбеддингов gte-large. Если вы используете модели эмбеддингов `gte-base` или `gte-small`, значение numDimension в индексе векторного поиска должно быть задано равным 768 и 384 соответственно." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 6: Установите соединение с данными\n", + "\n", + "Приведенный ниже фрагмент кода также использует PyMongo для создания объекта-клиента MongoDB, представляющего соединение с кластером и обеспечивающего доступ к его базам данных и коллекциям.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Oi0l9POtU0iP", + "outputId": "d3fe3cc4-8c08-4435-ddfc-8cfcc5ada572" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Connection to MongoDB successful\n" + ] + } + ], + "source": [ + "import pymongo\n", + "from google.colab import userdata\n", + "\n", + "\n", + "def get_mongo_client(mongo_uri):\n", + " \"\"\"Установление соединения с MongoDB.\"\"\"\n", + " try:\n", + " client = pymongo.MongoClient(mongo_uri)\n", + " print(\"Подключение к MongoDB выполнено успешно\")\n", + " return client\n", + " except pymongo.errors.ConnectionFailure as e:\n", + " print(f\"Не удалось установить соединение: {e}\")\n", + " return None\n", + "\n", + "\n", + "mongo_uri = userdata.get(\"MONGO_URI\")\n", + "if not mongo_uri:\n", + " print(\"MONGO_URI не задана в переменных окружения\")\n", + "\n", + "mongo_client = get_mongo_client(mongo_uri)\n", + "\n", + "# Загрузка данных в MongoDB\n", + "db = mongo_client[\"movies\"]\n", + "collection = db[\"movie_collection_2\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F7XXXa-OU1u9", + "outputId": "7bd1eb43-e933-4150-990a-fa20bad84e9a" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "DeleteResult({'n': 1452, 'electionId': ObjectId('7fffffff000000000000000c'), 'opTime': {'ts': Timestamp(1708554945, 1452), 't': 12}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1708554945, 1452), 'signature': {'hash': b'\\x99\\x89\\xc0\\x00Cn!\\xd6\\xaf\\xb3\\x96\\xdf\\xc3\\xda\\x88\\x11\\xf5\\t\\xbd\\xc0', 'keyId': 7320226449804230661}}, 'operationTime': Timestamp(1708554945, 1452)}, acknowledged=True)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Удаление всех существующих записей в коллекции\n", + "collection.delete_many({})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Загрузка данных в коллекцию MongoDB из Pandas DataFrame - это простой процесс, который можно эффективно выполнить, преобразовав DataFrame в словари, а затем использовать метод коллекции `insert_many` для передачи преобразованных записей датасета.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XrfMY4QBU2-l", + "outputId": "e2b5c534-2ba0-4ffa-bca8-1e96bef14c54" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Data ingestion into MongoDB completed\n" + ] + } + ], + "source": [ + "documents = dataset_df.to_dict(\"records\")\n", + "collection.insert_many(documents)\n", + "\n", + "print(\"Ввод данных в MongoDB завершен\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 7: Выполнение векторного поиска по запросам пользователей\n", + "\n", + "На следующем этапе реализуется функция, возвращающая результат векторного поиска путем создания эмбеддинга запроса и определения конвейера агрегации MongoDB.\n", + "\n", + "Конвейер, состоящий из этапов `$vectorSearch` и `$project`, выполняет запросы, используя сгенерированный вектор, и форматирует результаты так, чтобы они включали только необходимую информацию, такую как сюжет, название и жанры, а также включали оценку поиска для каждого результата." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "kWucnQBEU35k" + }, + "outputs": [], + "source": [ + "def vector_search(user_query, collection):\n", + " \"\"\"\n", + " Выполняем векторный поиск в коллекции MongoDB на основе запроса пользователя.\n", + "\n", + " Args:\n", + " user_query (str): Строка запроса пользователя.\n", + " collection (MongoCollection): Коллекция MongoDB для поиска.\n", + "\n", + " Returns:\n", + " list: Список совпадающих документов.\n", + " \"\"\"\n", + "\n", + " # Генерация эмбеддингов для пользовательского запроса\n", + " query_embedding = get_embedding(user_query)\n", + "\n", + " if query_embedding is None:\n", + " return \"Неверный запрос или сбой генерации эмбеддинга.\"\n", + "\n", + " # Определение конвейера векторного поиска\n", + " pipeline = [\n", + " {\n", + " \"$vectorSearch\": {\n", + " \"index\": \"vector_index\",\n", + " \"queryVector\": query_embedding,\n", + " \"path\": \"embedding\",\n", + " \"numCandidates\": 150, # Количество кандидатов для рассмотрения\n", + " \"limit\": 4, # Возврат 4 лучших совпадений\n", + " }\n", + " },\n", + " {\n", + " \"$project\": {\n", + " \"_id\": 0, # Исключите поле _id\n", + " \"fullplot\": 1, # Включить поле plot\n", + " \"title\": 1, # Включить поле title\n", + " \"genres\": 1, # Включить поле genres\n", + " \"score\": {\"$meta\": \"vectorSearchScore\"}, # Включите оценку поиска\n", + " }\n", + " },\n", + " ]\n", + "\n", + " # Execute the search\n", + " results = collection.aggregate(pipeline)\n", + " return list(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Шаг 8: Обработка пользовательских запросов и загрузка Gemma\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0ka4WLTmU5L4" + }, + "outputs": [], + "source": [ + "def get_search_result(query, collection):\n", + "\n", + " get_knowledge = vector_search(query, collection)\n", + "\n", + " search_result = \"\"\n", + " for result in get_knowledge:\n", + " search_result += f\"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\\n\"\n", + "\n", + " return search_result" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Z4L4SfueU6PY", + "outputId": "11ea30ca-8cac-4e4c-9ab6-780e043c6345" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is the best romantic movie to watch and why?\n", + "Continue to answer the query by using the Search Results:\n", + "Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?\n", + "Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as \"Pearl Harbor.\"\n", + "Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.\n", + "Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.\n", + ".\n" + ] + } + ], + "source": [ + "# Выполнение запроса с извлечением источников\n", + "query = \"What is the best romantic movie to watch and why?\"\n", + "source_information = get_search_result(query, collection)\n", + "combined_information = f\"Query: {query}\\nContinue to answer the query by using the Search Results:\\n{source_information}.\"\n", + "\n", + "print(combined_information)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 209, + "referenced_widgets": [ + "60c4d6d5e7a84fa493f101cc47dadef9", + "fa0c528cca744cff8da0a4fa21fdb4b5", + "d7d4a9f444fe4ebb9135035e2166a3a5", + "4e62b9ec821348cc94b34cfbc010c2a4", + "9d9a247c6569458abd0dcd6e0d717079", + "e5a6d300bbf441b8904aa9afb89e6f31", + "88226abe35534278bbd427d8eff0f5f8", + "2e081a17ddc04104882893a30c902265", + "938f8f60901442f2902eb51e86c27961", + "ce6a3a655d2f4ce2ab351c766568bed5", + "19fd13ad5b2740aa8be2a7d62488fdaf", + "c7c0c34a71954d6ea976c774573c49c5", + "0d6ec3bab579406fa4e6fc2b3d6b6998", + "a37f2164e11d4e5f851a4a09a12c663c", + "ef32431228f24a5498810a36b9cf6506", + "c06e354cb8294e66a3d7590a576571e0", + "e2998c2c6b1f4d489a5e39f2076838e4", + "4300755179d9465db871b14ae78dabc6", + "f14106c7f60f411199acf47f530443fd", + "bf88ee6dc83648d8a61d75bb4466b1e3", + "5776e818d9d34e009f95833056522876", + "06b1a069317041c8a9174c14fdc867bc", + "0e27bfa4f64f427d9996de0451e9edd9", + "d5e9f339fe7e4ab9955531cc125f071e", + "fa9cf3e72280417d8711ef7227a95d34", + "c3a1b520140444fbb40b7ac789f7ac0e", + "2c84bc5c158641f49f421a7d28da1518", + "6a15f1cf54a141fc9d6bb790366c6bdd", + "8813b56cd89744b58ace2787206e1501", + "edc37210db734d01a8afce596698bb27", + "eba6048eb694485693656fcbf4a4f297", + "30885be6a7c84f0f9f02bc2ea11679bc", + "29178e51df9e47489fff623763b130ed", + "5266bebcf8bb4b0798b14831a38e2a8c", + "7c638aaf734c423fbe54daddff97040f", + "4c6736981923464db2f754339c60cd0d", + "57383c03fc854a92a2ff732cbdd80a70", + "8a302ae0412b4393a17b50861afe36b5", + "b2fdc502d6ee491bb826fd616e35d407", + "2677891ce891475b8dc7c2ae287c58d7", + "fddbae43ce7f477cafaff89b81e47fc7", + "592d069be51e43f99212d47be9c11dcf", + "9a4c90a767c746659ea535d7c36d40a5", + "43fcf04b360f4a75be9fb99ab69fbe38", + "b7c439aa6d584c5784b46980050e503d", + "8aa8805651d34181b1851d302ccc47e2", + "713f1d91e445411288e565a33ce4b271", + "55941e08c602404c9342d00b7ee26918", + "87da02f5606d404ea242c3bd1f9ac38c", + "947f9b7e19dc4be4bd21b1b021e91f9d", + "0b7f3d233b8f4912bef4deae2e395001", + "6ccbd7b9ae924b5e843efd3114dfb2c5", + "9e0bccbc6072461fbf96482b870ed8d5", + "d7a00f1f114e4f008f4d5a48c1c69f53", + "faf25fd219f24bdbaa2e3202548c97d9", + "a0996675df13484aaa519e6ff45c5476", + "0bfb4937ed5547b3ba464ca47ac77f1a", + "7f59906980724a8b840dec85ce400f89", + "80f3d29327bf429481ad191b1abe556f", + "6d7c024126ac4c34825fae522234ebca", + "a0600fb407034c2d8df6ae5830d601db", + "c1d37ab1952b4d268d9786b74b6902d7", + "e7f471604a5a42e095d35d8ad399c6fe", + "feb438afda6b4c148a3a62ee7e03da74", + "e68cf53b04a845ac9d6f4047600ebc21", + "33fef11f829f49e2aa9555201d4a0e42" + ] + }, + "id": "OYGmKVv9mm8g", + "outputId": "ff41bfed-daa0-4ed8-8cc4-0aa138e697a1" + }, + "outputs": [], + "source": [ + "from transformers import AutoTokenizer, AutoModelForCausalLM\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"google/gemma-2b-it\")\n", + "# Для выполнения на CPU раскоментируйте код ниже 👇🏽\n", + "# model = AutoModelForCausalLM.from_pretrained(\"google/gemma-2b-it\")\n", + "# Для выполнения на GPU раскоментируйте код ниже 👇🏽\n", + "model = AutoModelForCausalLM.from_pretrained(\"google/gemma-2b-it\", device_map=\"auto\")" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wDA9SdXhsFyM", + "outputId": "c3300fa5-586c-48bd-9abb-b12a4390a294" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query: What is the best romantic movie to watch and why?\n", + "Continue to answer the query by using the Search Results:\n", + "Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?\n", + "Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as \"Pearl Harbor.\"\n", + "Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.\n", + "Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.\n", + ".\n", + "\n", + "Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.\n" + ] + } + ], + "source": [ + "# Передача тензоров на GPU\n", + "input_ids = tokenizer(combined_information, return_tensors=\"pt\").to(\"cuda\")\n", + "response = model.generate(**input_ids, max_new_tokens=500)\n", + "print(tokenizer.decode(response[0]))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "id": "FhMmFmUBwBcy" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + }, + "widgets": { + "application/vnd.jupyter.widget-state+json": { + "06b1a069317041c8a9174c14fdc867bc": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0b7f3d233b8f4912bef4deae2e395001": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "0bfb4937ed5547b3ba464ca47ac77f1a": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_a0600fb407034c2d8df6ae5830d601db", + "placeholder": "​", + "style": "IPY_MODEL_c1d37ab1952b4d268d9786b74b6902d7", + "value": "generation_config.json: 100%" + } + }, + "0d6ec3bab579406fa4e6fc2b3d6b6998": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e2998c2c6b1f4d489a5e39f2076838e4", + "placeholder": "​", + "style": "IPY_MODEL_4300755179d9465db871b14ae78dabc6", + "value": "Downloading shards: 100%" + } + }, + "0e27bfa4f64f427d9996de0451e9edd9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_d5e9f339fe7e4ab9955531cc125f071e", + "IPY_MODEL_fa9cf3e72280417d8711ef7227a95d34", + "IPY_MODEL_c3a1b520140444fbb40b7ac789f7ac0e" + ], + "layout": "IPY_MODEL_2c84bc5c158641f49f421a7d28da1518" + } + }, + "19fd13ad5b2740aa8be2a7d62488fdaf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2677891ce891475b8dc7c2ae287c58d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "29178e51df9e47489fff623763b130ed": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "2c84bc5c158641f49f421a7d28da1518": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "2e081a17ddc04104882893a30c902265": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "30885be6a7c84f0f9f02bc2ea11679bc": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "33fef11f829f49e2aa9555201d4a0e42": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4300755179d9465db871b14ae78dabc6": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "43fcf04b360f4a75be9fb99ab69fbe38": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "4c6736981923464db2f754339c60cd0d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_fddbae43ce7f477cafaff89b81e47fc7", + "max": 67121608, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_592d069be51e43f99212d47be9c11dcf", + "value": 67121608 + } + }, + "4e62b9ec821348cc94b34cfbc010c2a4": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_ce6a3a655d2f4ce2ab351c766568bed5", + "placeholder": "​", + "style": "IPY_MODEL_19fd13ad5b2740aa8be2a7d62488fdaf", + "value": " 13.5k/13.5k [00:00<00:00, 1.10MB/s]" + } + }, + "5266bebcf8bb4b0798b14831a38e2a8c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_7c638aaf734c423fbe54daddff97040f", + "IPY_MODEL_4c6736981923464db2f754339c60cd0d", + "IPY_MODEL_57383c03fc854a92a2ff732cbdd80a70" + ], + "layout": "IPY_MODEL_8a302ae0412b4393a17b50861afe36b5" + } + }, + "55941e08c602404c9342d00b7ee26918": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_d7a00f1f114e4f008f4d5a48c1c69f53", + "placeholder": "​", + "style": "IPY_MODEL_faf25fd219f24bdbaa2e3202548c97d9", + "value": " 2/2 [00:04<00:00,  1.94s/it]" + } + }, + "57383c03fc854a92a2ff732cbdd80a70": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_9a4c90a767c746659ea535d7c36d40a5", + "placeholder": "​", + "style": "IPY_MODEL_43fcf04b360f4a75be9fb99ab69fbe38", + "value": " 67.1M/67.1M [00:00<00:00, 465MB/s]" + } + }, + "5776e818d9d34e009f95833056522876": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "592d069be51e43f99212d47be9c11dcf": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "60c4d6d5e7a84fa493f101cc47dadef9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_fa0c528cca744cff8da0a4fa21fdb4b5", + "IPY_MODEL_d7d4a9f444fe4ebb9135035e2166a3a5", + "IPY_MODEL_4e62b9ec821348cc94b34cfbc010c2a4" + ], + "layout": "IPY_MODEL_9d9a247c6569458abd0dcd6e0d717079" + } + }, + "6a15f1cf54a141fc9d6bb790366c6bdd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6ccbd7b9ae924b5e843efd3114dfb2c5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "6d7c024126ac4c34825fae522234ebca": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "713f1d91e445411288e565a33ce4b271": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6ccbd7b9ae924b5e843efd3114dfb2c5", + "max": 2, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_9e0bccbc6072461fbf96482b870ed8d5", + "value": 2 + } + }, + "7c638aaf734c423fbe54daddff97040f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_b2fdc502d6ee491bb826fd616e35d407", + "placeholder": "​", + "style": "IPY_MODEL_2677891ce891475b8dc7c2ae287c58d7", + "value": "model-00002-of-00002.safetensors: 100%" + } + }, + "7f59906980724a8b840dec85ce400f89": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e7f471604a5a42e095d35d8ad399c6fe", + "max": 137, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_feb438afda6b4c148a3a62ee7e03da74", + "value": 137 + } + }, + "80f3d29327bf429481ad191b1abe556f": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e68cf53b04a845ac9d6f4047600ebc21", + "placeholder": "​", + "style": "IPY_MODEL_33fef11f829f49e2aa9555201d4a0e42", + "value": " 137/137 [00:00<00:00, 11.9kB/s]" + } + }, + "87da02f5606d404ea242c3bd1f9ac38c": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8813b56cd89744b58ace2787206e1501": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "88226abe35534278bbd427d8eff0f5f8": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "8a302ae0412b4393a17b50861afe36b5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "8aa8805651d34181b1851d302ccc47e2": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_947f9b7e19dc4be4bd21b1b021e91f9d", + "placeholder": "​", + "style": "IPY_MODEL_0b7f3d233b8f4912bef4deae2e395001", + "value": "Loading checkpoint shards: 100%" + } + }, + "938f8f60901442f2902eb51e86c27961": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "947f9b7e19dc4be4bd21b1b021e91f9d": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9a4c90a767c746659ea535d7c36d40a5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9d9a247c6569458abd0dcd6e0d717079": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "9e0bccbc6072461fbf96482b870ed8d5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "a0600fb407034c2d8df6ae5830d601db": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "a0996675df13484aaa519e6ff45c5476": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_0bfb4937ed5547b3ba464ca47ac77f1a", + "IPY_MODEL_7f59906980724a8b840dec85ce400f89", + "IPY_MODEL_80f3d29327bf429481ad191b1abe556f" + ], + "layout": "IPY_MODEL_6d7c024126ac4c34825fae522234ebca" + } + }, + "a37f2164e11d4e5f851a4a09a12c663c": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_f14106c7f60f411199acf47f530443fd", + "max": 2, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_bf88ee6dc83648d8a61d75bb4466b1e3", + "value": 2 + } + }, + "b2fdc502d6ee491bb826fd616e35d407": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "b7c439aa6d584c5784b46980050e503d": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_8aa8805651d34181b1851d302ccc47e2", + "IPY_MODEL_713f1d91e445411288e565a33ce4b271", + "IPY_MODEL_55941e08c602404c9342d00b7ee26918" + ], + "layout": "IPY_MODEL_87da02f5606d404ea242c3bd1f9ac38c" + } + }, + "bf88ee6dc83648d8a61d75bb4466b1e3": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "c06e354cb8294e66a3d7590a576571e0": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "c1d37ab1952b4d268d9786b74b6902d7": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "c3a1b520140444fbb40b7ac789f7ac0e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_30885be6a7c84f0f9f02bc2ea11679bc", + "placeholder": "​", + "style": "IPY_MODEL_29178e51df9e47489fff623763b130ed", + "value": " 4.95G/4.95G [00:16<00:00, 216MB/s]" + } + }, + "c7c0c34a71954d6ea976c774573c49c5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HBoxModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HBoxModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HBoxView", + "box_style": "", + "children": [ + "IPY_MODEL_0d6ec3bab579406fa4e6fc2b3d6b6998", + "IPY_MODEL_a37f2164e11d4e5f851a4a09a12c663c", + "IPY_MODEL_ef32431228f24a5498810a36b9cf6506" + ], + "layout": "IPY_MODEL_c06e354cb8294e66a3d7590a576571e0" + } + }, + "ce6a3a655d2f4ce2ab351c766568bed5": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d5e9f339fe7e4ab9955531cc125f071e": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_6a15f1cf54a141fc9d6bb790366c6bdd", + "placeholder": "​", + "style": "IPY_MODEL_8813b56cd89744b58ace2787206e1501", + "value": "model-00001-of-00002.safetensors: 100%" + } + }, + "d7a00f1f114e4f008f4d5a48c1c69f53": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "d7d4a9f444fe4ebb9135035e2166a3a5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_2e081a17ddc04104882893a30c902265", + "max": 13489, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_938f8f60901442f2902eb51e86c27961", + "value": 13489 + } + }, + "e2998c2c6b1f4d489a5e39f2076838e4": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e5a6d300bbf441b8904aa9afb89e6f31": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e68cf53b04a845ac9d6f4047600ebc21": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "e7f471604a5a42e095d35d8ad399c6fe": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "eba6048eb694485693656fcbf4a4f297": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + }, + "edc37210db734d01a8afce596698bb27": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "ef32431228f24a5498810a36b9cf6506": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_5776e818d9d34e009f95833056522876", + "placeholder": "​", + "style": "IPY_MODEL_06b1a069317041c8a9174c14fdc867bc", + "value": " 2/2 [00:17<00:00,  7.35s/it]" + } + }, + "f14106c7f60f411199acf47f530443fd": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "fa0c528cca744cff8da0a4fa21fdb4b5": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "HTMLModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "HTMLModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "HTMLView", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_e5a6d300bbf441b8904aa9afb89e6f31", + "placeholder": "​", + "style": "IPY_MODEL_88226abe35534278bbd427d8eff0f5f8", + "value": "model.safetensors.index.json: 100%" + } + }, + "fa9cf3e72280417d8711ef7227a95d34": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "FloatProgressModel", + "state": { + "_dom_classes": [], + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "FloatProgressModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/controls", + "_view_module_version": "1.5.0", + "_view_name": "ProgressView", + "bar_style": "success", + "description": "", + "description_tooltip": null, + "layout": "IPY_MODEL_edc37210db734d01a8afce596698bb27", + "max": 4945242264, + "min": 0, + "orientation": "horizontal", + "style": "IPY_MODEL_eba6048eb694485693656fcbf4a4f297", + "value": 4945242264 + } + }, + "faf25fd219f24bdbaa2e3202548c97d9": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "DescriptionStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "DescriptionStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "description_width": "" + } + }, + "fddbae43ce7f477cafaff89b81e47fc7": { + "model_module": "@jupyter-widgets/base", + "model_module_version": "1.2.0", + "model_name": "LayoutModel", + "state": { + "_model_module": "@jupyter-widgets/base", + "_model_module_version": "1.2.0", + "_model_name": "LayoutModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "LayoutView", + "align_content": null, + "align_items": null, + "align_self": null, + "border": null, + "bottom": null, + "display": null, + "flex": null, + "flex_flow": null, + "grid_area": null, + "grid_auto_columns": null, + "grid_auto_flow": null, + "grid_auto_rows": null, + "grid_column": null, + "grid_gap": null, + "grid_row": null, + "grid_template_areas": null, + "grid_template_columns": null, + "grid_template_rows": null, + "height": null, + "justify_content": null, + "justify_items": null, + "left": null, + "margin": null, + "max_height": null, + "max_width": null, + "min_height": null, + "min_width": null, + "object_fit": null, + "object_position": null, + "order": null, + "overflow": null, + "overflow_x": null, + "overflow_y": null, + "padding": null, + "right": null, + "top": null, + "visibility": null, + "width": null + } + }, + "feb438afda6b4c148a3a62ee7e03da74": { + "model_module": "@jupyter-widgets/controls", + "model_module_version": "1.5.0", + "model_name": "ProgressStyleModel", + "state": { + "_model_module": "@jupyter-widgets/controls", + "_model_module_version": "1.5.0", + "_model_name": "ProgressStyleModel", + "_view_count": null, + "_view_module": "@jupyter-widgets/base", + "_view_module_version": "1.2.0", + "_view_name": "StyleView", + "bar_color": null, + "description_width": "" + } + } + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/notebooks/ru/rag_zephyr_langchain.ipynb b/notebooks/ru/rag_zephyr_langchain.ipynb new file mode 100644 index 00000000..03ea25f7 --- /dev/null +++ b/notebooks/ru/rag_zephyr_langchain.ipynb @@ -0,0 +1,520 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Kih21u1tyr-I" + }, + "source": [ + "# Простой RAG для проблем GitHub с использованием Hugging Face Zephyr и LangChain\n", + "\n", + "_Автор: [Мария Халусова](https://github.com/MKhalusova)_\n", + "\n", + "Этот блокнот демонстрирует, как можно быстро создать RAG (Retrieval Augmented Generation) для проблем (issues) проекта на GitHub, используя модель [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), и LangChain.\n", + "\n", + "\n", + "**Что такое RAG?**\n", + "\n", + "RAG - это популярный подход к решению проблемы, связанной с тем, что мощная LLM не знает о конкретном контенте, поскольку его нет в ее обучающих данных, или галлюцинирует, даже если видела его ранее. Такой специфический контент может быть проприетарным, конфиденциальным или, как в данном примере, недавно появившимся и часто обновляемым.\n", + "\n", + "Если ваши данные статичны и не меняются регулярно, вы можете рассмотреть возможность дообучения большой модели. Однако во многих случаях дообучение может быть дорогостоящим, а при многократном повторении (например, для устранения дрейфа данных (address data drift) приводить к \"сдвигу модели (model shift)\". Это когда поведение модели изменяется нежелательным образом.\n", + "\n", + "**RAG (Retrieval Augmented Generation, генерация с расширенным извлечением информации)** не требует дообучения модели. Вместо этого RAG работает, предоставляя LLM дополнительный контекст, который извлекается из соответствующих данных, чтобы она могла генерировать более обоснованный ответ.\n", + "\n", + "Вот небольшая иллюстрация:\n", + "\n", + "![RAG диаграмма](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png)\n", + "\n", + "* Внешние данные преобразуются в эмбеддинг векторы с помощью отдельной модели эмбеддингов, сами векторы хранятся в базе данных. Модели эмбеддинга обычно невелики, поэтому регулярное обновление эмбеддинг векторов происходит быстрее, дешевле и проще, чем дообучение модели.\n", + "\n", + "* В то же время тот факт, что дообучение не требуется, дает вам возможность поменять вашу LLM на более мощную, когда она появится, или перейти на более компактную дистиллированную версию, если вам понадобится более быстрый инференс.\n", + "\n", + "Давайте проиллюстрируем создание RAG с помощью LLM с открытым исходным кодом, модели эмбеддинга и LangChain.\n", + "\n", + "Сначала установите необходимые зависимости:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lC9frDOlyi38" + }, + "outputs": [], + "source": [ + "! pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "-aYENQwZ-p_c" + }, + "outputs": [], + "source": [ + "# Если вы используете Google Colab, вам может понадобиться запустить эту ячейку, чтобы убедиться, что вы используете UTF-8 для установки LangChain\n", + "import locale\n", + "locale.getpreferredencoding = lambda: \"UTF-8\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W5HhMZ2c-NfU" + }, + "outputs": [], + "source": [ + "! pip install -q langchain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8po01vMWzXL" + }, + "source": [ + "## Подготовка данных\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3cCmQywC04x6" + }, + "source": [ + "В этом примере мы загрузим все проблемы (issues) (как открытые, так и закрытые) из [репозитория библиотеки PEFT](https://github.com/huggingface/peft).\n", + "\n", + "Во-первых, вам необходимо получить [персональный токен доступа GitHub](https://github.com/settings/tokens?type=beta) чтобы получить доступ к API GitHub." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8MoD7NbsNjlM" + }, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "ACCESS_TOKEN = getpass(\"YOUR_GITHUB_PERSONAL_TOKEN\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fccecm3a10N6" + }, + "source": [ + "Далее мы загрузим все проблемы (issues) из репозитория [huggingface/peft](https://github.com/huggingface/peft):\n", + "- По умолчанию предложения об изменении кода (pull requests) также считаются проблемами, но здесь мы решили исключить их из данных, установив `include_prs=False`.\n", + "- Задание `state = \"all\"` означает, что мы будем загружать как открытые, так и закрытые проблемы (issues)." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "8EKMit4WNDY8" + }, + "outputs": [], + "source": [ + "from langchain.document_loaders import GitHubIssuesLoader\n", + "\n", + "loader = GitHubIssuesLoader(\n", + " repo=\"huggingface/peft\",\n", + " access_token=ACCESS_TOKEN,\n", + " include_prs=False,\n", + " state=\"all\"\n", + ")\n", + "\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CChTrY-k2qO5" + }, + "source": [ + "Содержание отдельных проблем (issues) на GitHub может быть длиннее, чем то, что модель эмбеддингов может принять в качестве входных данных. Если мы хотим использовать весь доступный контент, нам нужно разбить документы на фрагменты (chunk) соответствующего размера.\n", + "\n", + "Наиболее распространенный и простой подход к фрагментации заключается в определении фиксированного размера фрагментов (chunk) и того, должно ли между ними быть какое-либо перекрытие. Сохранение некоторого перекрытия между фрагментами позволяет нам сохранить некоторый семантический контекст между фрагментами. Рекомендуемый сплиттер для текстов общего содержания - [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), и именно его мы будем использовать. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OmsXOf59Pmm-" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", + "\n", + "chunked_docs = splitter.split_documents(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DAt_zPVlXOn7" + }, + "source": [ + "## Создание эмбеддингов + retriever" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mvat6JQl4yp" + }, + "source": [ + "Теперь, когда все документы имеют подходящий размер, мы можем создать базу данных с их эмбеддингами.\n", + "\n", + "Для создания фрагментов эмбеддингов документов мы будем использовать модель эмбеддингов `HuggingFaceEmbeddings` и [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5). Есть много других моделей эмбеддингов, доступных на Hub, и вы можете отслеживать самые эффективные из них, проверяя [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).\n", + "\n", + "\n", + "Для создания базы векторов мы воспользуемся библиотекой `FAISS`, разработанной Facebook AI. Эта библиотека обеспечивает эффективный поиск сходства и кластеризацию плотных векторов (dense vectors), что нам и нужно. В настоящее время FAISS является одной из наиболее используемых библиотек для NN поиска в массивных наборах данных.\n", + "\n", + "Мы получим доступ к модели эмбеддингов и FAISS через LangChain API." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ixmCdRzBQ5gu" + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import FAISS\n", + "from langchain.embeddings import HuggingFaceEmbeddings\n", + "\n", + "db = FAISS.from_documents(chunked_docs,\n", + " HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2iCgEPi0nnN6" + }, + "source": [ + "Нам нужен способ возврата (retrieve) документов по неструктурированному запросу. Для этого мы воспользуемся методом `as_retriever`, используя `db` в качестве основы:\n", + "- `search_type=\"similarity\"` означает, что мы хотим выполнить поиск по сходству (similarity) между запросом и документами\n", + "- `search_kwargs={'k': 4}` указывает retriever возвращать 4 лучших результата.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "mBTreCQ9noHK" + }, + "outputs": [], + "source": [ + "retriever = db.as_retriever(\n", + " search_type=\"similarity\",\n", + " search_kwargs={'k': 4}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WgEhlISJpTgj" + }, + "source": [ + "Векторная база данных и retriever настроены, осталось настроить следующий элемент цепочки - модель." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tzQxx0HkXVFU" + }, + "source": [ + "## Загрузка квантизованной модели" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9jy1cC65p_GD" + }, + "source": [ + "Для этого примера мы выбрали [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), небольшую, но эффективную модель.\n", + "\n", + "Поскольку каждую неделю выходит множество моделей, вы можете захотеть заменить эту модель на самую последнюю и лучшую. Лучший способ отслеживать LLM с открытым исходным кодом - следить за [таблицей лидеров Open-source LLM](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n", + "\n", + "Чтобы ускорить инференс, мы загрузим квантизованную версию модели:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L-ggaa763VRo" + }, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n", + "\n", + "model_name = 'HuggingFaceH4/zephyr-7b-beta'\n", + "\n", + "bnb_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type=\"nf4\",\n", + " bnb_4bit_compute_dtype=torch.bfloat16\n", + ")\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hVNRJALyXYHG" + }, + "source": [ + "## Создание цепочки LLM" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RUUNneJ1smhl" + }, + "source": [ + "Наконец, у нас есть все, что нужно для создания цепочки LLM.\n", + "\n", + "Во-первых, создадим конвейер генерации текста (text_generation) используя загруженную модель и ее токенизатор.\n", + "\n", + "Затем создадим шаблон подсказки (prompt) - он должен соответствовать формату модели, поэтому, если вы заменяете контрольную точку модели, убедитесь, что используете соответствующее форматирование." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "cR0k1cRWz8Pm" + }, + "outputs": [], + "source": [ + "from langchain.llms import HuggingFacePipeline\n", + "from langchain.prompts import PromptTemplate\n", + "from transformers import pipeline\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "\n", + "text_generation_pipeline = pipeline(\n", + " model=model,\n", + " tokenizer=tokenizer,\n", + " task=\"text-generation\",\n", + " temperature=0.2,\n", + " do_sample=True,\n", + " repetition_penalty=1.1,\n", + " return_full_text=True,\n", + " max_new_tokens=400,\n", + ")\n", + "\n", + "llm = HuggingFacePipeline(pipeline=text_generation_pipeline)\n", + "\n", + "prompt_template = \"\"\"\n", + "<|system|>\n", + "Answer the question based on your knowledge. Use the following context to help:\n", + "\n", + "{context}\n", + "\n", + "\n", + "<|user|>\n", + "{question}\n", + "\n", + "<|assistant|>\n", + "\n", + " \"\"\"\n", + "\n", + "prompt = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=prompt_template,\n", + ")\n", + "\n", + "llm_chain = prompt | llm | StrOutputParser()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l19UKq5HXfSp" + }, + "source": [ + "Примечание: _Вы также можете использовать `tokenizer.apply_chat_template` для преобразования списка сообщений (в виде dicts: `{'role': 'user', 'content': '(...)'}`) в строку с соответствующим форматом чата._\n", + "\n", + "\n", + "Наконец, нам нужно объединить `llm_chain` с retriever, чтобы создать RAG-цепочку. На последнем этапе генерации мы передаем оригинальный вопрос, а также извлеченные контекстные документы:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "_rI3YNp9Xl4s" + }, + "outputs": [], + "source": [ + "from langchain_core.runnables import RunnablePassthrough\n", + "\n", + "retriever = db.as_retriever()\n", + "\n", + "rag_chain = (\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + " | llm_chain\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UsCOhfDDXpaS" + }, + "source": [ + "## Сравним результаты\n", + "\n", + "Давайте посмотрим, как RAG влияет на генерирование ответов на специфические для библиотеки вопросы." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "W7F07fQLXusU" + }, + "outputs": [], + "source": [ + "question = \"How do you combine multiple adapters?\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KC0rJYU1x1ir" + }, + "source": [ + "Сначала посмотрим, какой ответ мы можем получить, используя только саму модель, без добавления контекста:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 125 + }, + "id": "GYh-HG1l0De5", + "outputId": "277d8e89-ce9b-4e04-c11b-639ad2645759" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use together. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer only has USB-A ports).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a USB-C laptop to an HDMI monitor, plug the USB-C to HDMI adapter into the laptop's USB-C port.\\n\\n3. Connect the second adapter: Next, connect the second adapter to the first one. In this case, connect the USB-C to USB-A adapter to the USB-C port of the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. For example, connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. If necessary, adjust the settings on your devices to ensure optimal performance.\\n\\nBy combining multiple adapters, you can connect a variety of devices together, even if they don't have the same type of connector. Just be sure to choose adapters that are compatible with all the devices you want to connect and test the connection thoroughly before relying on it for critical tasks.\"" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "llm_chain.invoke({\"context\":\"\", \"question\": question})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i-TIWr3wx9w8" + }, + "source": [ + "Как видите, модель интерпретировала вопрос как вопрос о физических компьютерных адаптерах, тогда как в контексте PEFT под \"адаптерами\" подразумеваются адаптеры LoRA.\n", + "Посмотрим, поможет ли добавление контекста из проблем (issues) GitHub дать более релевантный ответ:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 125 + }, + "id": "FZpNA3o10H10", + "outputId": "31f9aed3-3dd7-4ff8-d1a8-866794fefe80" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\" Based on the provided context, it seems that combining multiple adapters is still an open question in the community. Here are some possibilities:\\n\\n 1. Save the output from the base model and pass it to each adapter separately, as described in the first context snippet. This allows you to run multiple adapters simultaneously and reuse the output from the base model. However, this approach requires loading and running each adapter separately.\\n\\n 2. Export everything into a single PyTorch model, as suggested in the second context snippet. This would involve saving all the adapters and their weights into a single model, potentially making it larger and more complex. The advantage of this approach is that it would allow you to run all the adapters simultaneously without having to load and run them separately.\\n\\n 3. Merge multiple Lora adapters, as mentioned in the third context snippet. This involves adding multiple distinct, independent behaviors to a base model by merging multiple Lora adapters. It's not clear from the context how this would be done, but it suggests that there might be a recommended way of doing it.\\n\\n 4. Combine adapters through a specific architecture, as proposed in the fourth context snippet. This involves merging multiple adapters into a single architecture, potentially creating a more complex model with multiple behaviors. Again, it's not clear from the context how this would be done.\\n\\n Overall, combining multiple adapters is still an active area of research, and there doesn't seem to be a widely accepted solution yet. If you're interested in exploring this further, it might be worth reaching out to the Hugging Face community or checking out their documentation for more information.\"" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rag_chain.invoke(question)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hZQedZKSyrwO" + }, + "source": [ + "Как мы видим, добавление контекста действительно помогает той же самой модели дать гораздо более релевантный и обоснованный ответ на вопрос, связанный с библиотекой.\n", + "\n", + "Примечательно, что объединение нескольких адаптеров для инференса было добавлено в библиотеку, и эту информацию можно найти в документации, так что для следующей итерации этого RAG, возможно, стоит включить эмбеддинг документации." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}