Skip to content

Commit f16ca12

Browse files
committed
Fix pdf template for latest datatype refactorings
1 parent f663b8e commit f16ca12

File tree

42 files changed

+265
-1084
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+265
-1084
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -197,3 +197,4 @@ test/sleep.json
197197
/deploy/testenv/requirements.txt
198198
/superduper/rest/superdupertmp
199199
/example*
200+
.cache

CHANGELOG.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
**Before you create a Pull Request, remember to update the Changelog with your changes.**
99

10-
## Changes Since Last Release
10+
## Changes Since Last Release
1111

1212
#### Changed defaults / behaviours
1313

plugins/sentence_transformers/superduper_sentence_transformers/model.py

+2-4
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22

33
from sentence_transformers import SentenceTransformer as _SentenceTransformer
44
from superduper.backends.query_dataset import QueryDataset
5-
from superduper.base.enums import DBType
65
from superduper.components.component import ensure_initialized
76
from superduper.components.model import Model, Signature, _DeviceManaged
87

@@ -62,9 +61,9 @@ def __post_init__(self, db, example):
6261
self.object = _SentenceTransformer(self.model, device=self.device)
6362
self._default_model = True
6463

65-
def dict(self, metadata: bool = True, defaults: bool = True):
64+
def dict(self, metadata: bool = True, defaults: bool = True, refs: bool = False):
6665
"""Serialize as a dictionary."""
67-
r = super().dict(metadata=metadata, defaults=defaults)
66+
r = super().dict(metadata=metadata, defaults=defaults, refs=refs)
6867
if self._default_model:
6968
del r['object']
7069
return r
@@ -125,4 +124,3 @@ def _pre_create(self, db):
125124
"""
126125
if self.datatype is not None:
127126
return
128-

templates/pdf_rag/Dockerfile

+8-10
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,13 @@ FROM python:3.10
22

33
WORKDIR /app
44

5-
COPY ./superduper .
6-
COPY ./plugins .
5+
COPY ./superduper superduper
6+
COPY pyproject.toml pyproject.toml
7+
COPY ./plugins plugins
78
COPY ./templates/pdf_rag templates/pdf_rag
89

9-
RUN pip install -e . && pip install plugins/mongodb && pip install streamlit
10-
RUN pip install --no-cache-dir ./templates/pdf_rag/requirements.txt
11-
12-
RUN superduper bootstrap ./templates/pdf_rag
13-
RUN python3 add_data.py
14-
RUN superduper apply pdf_rag
15-
16-
CMD ["streamlit", "run", "templates/pdf_rag/streamlit.py", "--server.runOnSave", "true"]
10+
RUN pip install . plugins/mongodb streamlit ipython jupyter -r ./templates/pdf_rag/requirements.txt
11+
RUN chmod +x templates/pdf_rag/install.sh
12+
RUN ./templates/pdf_rag/install.sh
13+
14+
CMD ["/bin/bash"]

templates/pdf_rag/README.md

+68
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
## PDF RAG template
2+
3+
**Clone the code**
4+
5+
```bash
6+
git clone --branch main --depth 1 [email protected]:superduper-io/superduper.git
7+
```
8+
9+
**Build the docker**
10+
11+
```bash
12+
docker build . -f templates/pdf_rag/Dockerfile -t pdf-rag
13+
```
14+
15+
**Start the docker container with the ports opened and data mounted**
16+
17+
You need to mount a local directory to `/app/data` in the container.
18+
19+
***The PDF directory should be a sub-directory of this directory!***
20+
21+
```bash
22+
docker run -it -p 8501:8501 -v ~/.superduper/:/root/.superduper -v ./data/:/app/data pdf-rag bash
23+
```
24+
25+
**Set your OpenAI or Ollama credentials**
26+
27+
(Run this in the container)
28+
29+
```bash
30+
export OPENAI_API_KEY=ollama # or sk-<secret> for openai
31+
export OPENAI_BASE_URL=... # URL of ollama server if applicable
32+
```
33+
34+
**Set your MongoDB connection**
35+
36+
```bash
37+
export SUPERDUPER_DATA_BACKEND=mongodb://host.docker.internal:27017/test_db
38+
export SUPERDUPER_ARTIFACT_STORE=filesystem://./data
39+
```
40+
41+
***Prepare the app with your choice of models and data**
42+
43+
```bash
44+
bash templates/pdf_rag/start.sh bodybuilder <embedding_model> <llm_model>
45+
```
46+
47+
For example, on OpenAI:
48+
49+
```bash
50+
bash templates/pdf_rag/start.sh bodybuilder text-embedding-ada-002 gpt-3.5-turbo
51+
```
52+
53+
For example, on Ollama:
54+
55+
```bash
56+
bash templates/pdf_rag/start.sh bodybuilder nomic-embed-text:latest llama3.1:70b
57+
```
58+
59+
***Run the app's frontend**
60+
61+
```bash
62+
python3 -m streamlit run templates/pdf-rag/streamlit.py
63+
```
64+
65+
### Notes
66+
67+
- If you exit the docker container, you will need to reset your environment variables (`export $...`)
68+
- If you are using Ollama, you need to use models which are installed on the server

templates/pdf_rag/add_data.py

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
import os
2+
import sys
3+
4+
from superduper import Schema, Table
5+
from superduper.components.datatype import file
6+
7+
8+
pdf_folder = sys.argv[1]
9+
pdf_names = [pdf for pdf in os.listdir(pdf_folder) if pdf.endswith(".pdf")]
10+
pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in pdf_names]
11+
data = [{"url": pdf_path, "file": pdf_path} for pdf_path in pdf_paths]
12+
from superduper import superduper
13+
db = superduper()
14+
COLLECTION_NAME = next(x for x in pdf_folder.split('/')[::-1] if x)
15+
schema = Schema(identifier="myschema", fields={'url': 'str', 'file': file})
16+
table = Table(identifier=COLLECTION_NAME, schema=schema)
17+
db.apply(table, force=True)
18+
db[COLLECTION_NAME].insert(data).execute()
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)