Skip to content

Commit

Permalink
update code
Browse files Browse the repository at this point in the history
  • Loading branch information
xavieryao committed Feb 16, 2023
1 parent 74fa8db commit 6b56762
Show file tree
Hide file tree
Showing 250 changed files with 2,005 additions and 1,393 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ snc/src/config.py
tmp/
*/tmp/
.env
**/.DS_Store
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022 WDM@UofA
Copyright (c) 2023 WDM@UofA

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
226 changes: 129 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,120 +1,39 @@
nlpworkbench
====

## What-Is
- [The purpose of this project](#about)
- [The layout of this project](#layout)
- [The call stack of an API call](#api-call-stack)
## How-To
- [Deploy the whole thing on a single machine](#deployment)
- [Add a new NLP tool](#extension)
- [Write and run tests](#testing)
- [Move some NLP tools to another machine](#distributed-deployment)
- [Backup and restore the workbench](#restoring-from-backups)
- [Understand an API call stack](#api-call-stack)

## About
Please refer to the paper.

## Deployment
Docker is the preferred way of deployment.
### Docker
Requires a newer docker and docker compose plugin. Tested with docker v20.10.16.

On the host machine, prepare the folders for persisting data
```bash
mkdir /path/to/neo4j/data # folder for storing neo4j data
mkdir /path/to/es/data # folder for storing elasticsearch data
mkdir /path/to/sqlite/data # folder for storing embeddings
touch /path/to/sqlite/data/embeddings.sqlite3 # database file for storing embeddings

mkdir -p /path/to/neo4j/certs/bolt/ # folder for storing neo4j certificates
cp /path/to/privkey.pem /path/to/neo4j/certs/bolt/private.key
cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/public.crt
cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/trusted/public.crt

# change permission to writable
chmod a+rwx /path/to/neo4j/data
chmod a+rwx /path/to/es/data
chmod a+rwx /path/to/sqlite/data/embeddings.sqlite3

chown -R 7474:7474 /path/to/neo4j/certs/
# change permissions of neo4j certificates following https://neo4j.com/docs/operations-manual/current/security/ssl-framework/#ssl-bolt-config
# just for example,
chmod 0755 /path/to/neo4j/certs/bolt/private.key
```

Modify `docker-compose.yml` file to mount the volumes to the correct locations (the folders you created above). Search for `volumes:` or `# CHANGE THIS` in `docker-compose.yml` and replace `source: ` with the correct path.

Follow this [document](https://www.elastic.co/guide/en/kibana/current/docker.html) to set elasticsearch passwords and generate enrollment tokens for kibana.
```bash
# set password for user elastic
docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic -i
# set password for user kibana_system
docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u kibana_system -i
# generate an enrollment token for kibana
docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token -s kibana
```
Open kibana in a browser and use the enrollment token to set up kibana.


Modify the port mapping in `docker-compose.yml` file under `services -> frontend -> ports` to change the exposed port. The current one is 8080, which means `http://localhost:8080` is the url for the workbench.

Clone the repositories and build docker images:
```bash
# clone thirdparty nlp tools and frontend
git submodule init
git submodule update
# build images
docker compose --profile non-gpu --profile gpu build
docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu build
# run
docker compose --profile non-gpu --profile gpu up
docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu up
```

Finally, run some quick tests. The entity linker will not work because the knowledge graph is empty. "Feelin' Lucky" also will not work because the article collection in ES is empty. But entity recognition, semantic parsing, and relation extraction should be working fine.

Paste a news link in the input box and click "Load News" to run some tests.

### Manual deployment
You shouldn't need to deploy things manually. The docker service does all the steps here.

#### Run a development server
Create a new virtual environment and run `pip3 install -r requirements-api.txt`

Use `flask run --host=127.0.0.1 --port=3000` to start a development server.

#### Deploy API server for production
1. Install nginx, and configure SSL certificates (probably with let's encrypt).
2. Put `confs/nginx-conf` in `/etc/nginx/sites-enabled/`. You can do this by `ln -s confs/nginx-conf /etc/nginx/sites-enabled/nlpworkbench-backend`
3. Run `sudo nginx -s reload` to load the site configuration.
4. Put `confs/demo-api.service` in `/etc/systemd/system/`. You can do this by `ln -s confs/demo-api.service /etc/systemd/system/demo-api.service`.
5. Run `sudo systemd daemon-reload` to load the service configuration.
6. Run `sudo systemd start demo-api` to start the backend.

#### Setting up components
##### Article DB
Articles are stored and indexed in Elasticsearch. After setting up Elasticsearch, modify config.py and point `es_url` and `es_auth` to the es server, and change `es_article_collection` to the collection name.

##### Named entity recognition
Download the pre-trained model from [https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip](https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip) and decompress it.

Clone the NER system from `[email protected]:UAlberta/nlpwokkbench/pure-ner.git`.

In `config.py`, point `ner_script` to `run_ner.py` in the PURE NER repository, and `ner_model` to the model folder.

##### Entity linking
Entities are indexed in elasticsearch by their aliases. Point `es_entity_collection` to the entity collection in ES.

After generating candidates from a ES query, our system rescores all candidates by comparing sentence embeddings. For this we'll need pre-computed embeddings for all the entities (stored in an sqlite db, [download here](https://drive.google.com/file/d/17cvpeiwifVMBJ-Sidqq_f-pskvyyAlhv/view?usp=sharing)). Change `embeddings_db` in `config.py` to the sqlite file.

Neo4j stores entity attributes and descriptions. Set `neo4j_url` and `neo4j_auth` properly.

##### Semantic parsing
Clone `[email protected]:UAlberta/nlpwokkbench/amrbart.git`

Download the pre-trained model by running
```bash
# apt install git-lfs
git lfs install
git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMR2Text
```
The workbench should be up and running on `http://localhost:8085`. Paste a news link in the input box and click "Load News" to run some tests.

Point `amr_script` to `inference_amr.py` in the AMRBART repository, and set `amr_model` properly.
Without further configurations some parts will not be working: Kibana needs to be paired with Elasticsearch. The entity linker will not work because the knowledge graph is empty. "Feelin' Lucky" also will not work because the article collection in ES is empty. But entity recognition, semantic parsing, and relation extraction should be working fine.

By default docker creates temporary volumes to store data. During production we want to persist things, and this is done by binding locations on the host to the containers. We also need to configure Neo4j and Kibana for pair with the servers. Details on deploying in production mode are documented [here](docs/deploy-production.md).
## Extension
![Architecture](arch.svg)
![Architecture](docs/arch.svg)
The architecture of the nlpworkbench is shown in the above figure. Each NLP tool / model runs in its independent container, and communicates with the API server using Celery, or alternatively any protocol you like.

The goal of using Celery is that you can move any Python function to any physical machine and
Expand Down Expand Up @@ -275,7 +194,7 @@ docker run -it --rm \
**neo4j image version must match dump version and dbms version!!**

## API Call Stack
This [diagram](callstack.pdf) shows what's happening behind the scene when an API call is made to run NER on a document.
This [diagram](docs/callstack.pdf) shows what's happening behind the scene when an API call is made to run NER on a document.

When a REST API is called, the NGINX reverse proxy (running in `frontend` container) decrypts the HTTPS request, and passes it to the `api` container. Inside the `api` container, `gunicorn` passes the request to one of the Flask server processes. Numbers below correspond to labels in the diagram.

Expand All @@ -290,4 +209,117 @@ When a REST API is called, the NGINX reverse proxy (running in `frontend` contai
9. `call` function is a wrapper in PURE NER's code base. PURE NER initially can only be called via command line (handled in `__main__`) and this wrapper function pretends inputs are from the command line.
10. We also have lazy loading helper functions so that models are only loaded once.
11. Output of PURE NER is automatically stored in Elasticsearch by the `es_cache` decorator.
12. NER output is formatted to suit the need of the frontend, and responded to the user.
12. NER output is formatted to suit the need of the frontend, and responded to the user.

## Layout
```
build/
|
|-- Dockerfile.api
|-- Dockerfile.service
frontend/
requirements/
|
|-- api.txt
|-- service.txt
workbench/
|
|-- __init__.py
|-- rpc.py
|-- snc/
|
|-- __init__.py
|-- thirdparty/
| -- amrbart/
| -- pure-ner/
docker-compose.yml
```
`build/` folder contains all `Dockerfile`s and `requirements/` folder contains `requirements.txt` for each micro-service.
`workbench/` contains all Python code. The folder and all of its subfolders (except `thirdparty/`) are [Python packages](https://docs.python.org/3/tutorial/modules.html#packages). A `__init__.py` file must be present in every subfolder. [Relative imports](https://docs.python.org/3/tutorial/modules.html#intra-package-references) (`from . import config`, or `from ..rpc import create_celery`) are the preferred way to reference modules within the `workbench` package.
## Testing
We are moving towards test-driven development. The infrastructure for unit tests are available.te
### Writing unit tests
We use the [pytest](https://docs.pytest.org/en/7.2.x/) framework to test Python code. It is a very light-weight framework: to write tests one would create a file `tests/test_some_module.py` containing functions with name `test_some_feature` and writing `assert` statements.
Here's a snippet from `tests/test_sentiment.py` that tests the VADER sentiment analyzer:
```python
from workbench import vader
def test_classify_positive_sents():
positive_sents = [
"I love this product.",
"Fantastic!",
"I am so happy.",
"This is a great movie."
]
for sent in positive_sents:
output = vader.run_vader(sent)
assert output["polarity_compound"] > 0.3
```

Running `python3 -m pytest tests/test_sentiments.py` will provide a report for this set of unit tests like:
```
==================== test session starts ====================
platform linux -- Python 3.9.12, pytest-7.1.1, pluggy-1.0.0
rootdir: /data/local/workbench-dev, configfile: pyproject.toml
plugins: anyio-3.5.0
collected 3 items
tests/test_sentiment.py ... [100%]
============== 3 passed, 3 warnings in 0.37s ==============
```

In the real world we don't directly run code, and instead we use Docker. Unit test is added into a docker image separate from the image used to run the service, by using [multi-stage build](https://docs.docker.com/language/java/run-tests/). Still using VADER as the example, the Dockerfile after adding tests becomes:
```Dockerfile
FROM python:3.7 AS base
WORKDIR /app
RUN mkdir /app/cache && mkdir /app/vader_log && mkdir /app/lightning_logs
COPY requirements/vader.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install -r requirements.txt
COPY workbench/ ./workbench
ENV PYTHONUNBUFFERED=TRUE

FROM base as prod
CMD ["python3", "-m", "workbench.vader"]

FROM base as test
COPY .coveragerc ./
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install pytest==7.2 coverage==7.0.5
COPY tests ./tests
CMD ["coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_sentiment.py"]
```
The `base` image contains all the source code and dependencies to run VADER. The `prod` image starts the process that serves VADER. The `test` image is used to run testing, where tests are added and test frameworks are installed. Running a container with the `test` image will invoke the tests.

After adding multi-stage build, `docker-compose.dev.yml` needs to be changed to specify the default build stage as `prod`:
```yaml
vader:
build:
dockerfile: ./build/Dockerfile.vader
target: ${COMPOSE_TARGET:-prod}
```
### Run tests locally
`run-test.sh` provides the scripts to run tests on your local machine using Docker. To test VADER:
```bash
./run-test.sh build vader # build the `test` stage image for vader
./run-test.sh test vader # run a container with the `test` image
# repeat the process for other services.
# `vader` can be replaced with other services defined in `docker-compose.dev.yml`
./run-test.sh coverage # combine coverage info from all tests and print coverage report
```

### Automated testing
Once your commits are pushed to GitLab, a pipeline is triggered to automatically run tests. The pipeline badge indicates whether tests are passed, and the coverage badge shows the line coverage percentage.

![pipeline](https://gitlab.com/UAlberta/nlpwokkbench/workbench-api/badges/dev/pipeline.svg)
![coverage](https://gitlab.com/UAlberta/nlpwokkbench/workbench-api/badges/dev/coverage.svg)

The tests triggered by the push are defined in `.gitlab-ci.yml`. When adding new tests, a `build-something` job and a `test-something` job should be added following the structure of existing jobs in the file.

The test jobs will be executed by a local runner on one of our own machines (rather than on a shared runner provided by GitLab). The local GitLab Runner is installed on our machines as a Docker container, following [the official tutorial](https://docs.gitlab.com/runner/install/docker.html). Our local runner is then [registered with the repository](https://docs.gitlab.com/runner/register/index.html). The default image for the docker executors is `docker:20.10.16`. We are using [Docker socket binding](https://docs.gitlab.com/ee/ci/docker/using_docker_build.html#use-docker-socket-binding) so that docker images / containers created within docker containers will be running **on the host system**, instead of becoming nested containers. This is beneficial for caching and reusing layers.
1 change: 0 additions & 1 deletion arch.svg

This file was deleted.

14 changes: 7 additions & 7 deletions build/Dockerfile.amr
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
FROM continuumio/miniconda3:4.10.3
WORKDIR /app
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs
COPY thirdparty/amrbart/requirements.yml .
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs && mkdir -p /app/workbench/thirdparty
COPY workbench/thirdparty/amrbart/requirements.yml .
RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \
apt-get update && apt-get install -y git-lfs && \
rm -rf /var/lib/apt/lists/* && \
conda create --name amrbart && \
conda init bash && \
conda env update --file requirements.yml --name amrbart && \
git lfs install && \
git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing thirdparty/AMRBART-large-finetuned-AMR3.0-AMRParsing && \
git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing workbench/thirdparty/AMRBART-large-finetuned-AMR3.0-AMRParsing && \
conda run --no-capture-output -n amrbart python3 -m spacy download en_core_web_sm
COPY thirdparty/amrbart /app/thirdparty/amrbart
RUN conda run --no-capture-output -n amrbart pip3 install -e thirdparty/amrbart/spring && \
COPY workbench/thirdparty/amrbart /app/workbench/thirdparty/amrbart
RUN conda run --no-capture-output -n amrbart pip3 install -e workbench/thirdparty/amrbart/spring && \
conda run --no-capture-output -n amrbart python3 -c "from spring_amr.tokenization_bart import PENMANBartTokenizer; PENMANBartTokenizer.from_pretrained('facebook/bart-large')"
ENV AMRPARSING_RPC=1
COPY . .
CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "semantic.py"]
COPY workbench/ ./workbench
CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "-m", "workbench.semantic"]
8 changes: 4 additions & 4 deletions build/Dockerfile.amr2text
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
FROM continuumio/miniconda3:4.10.3
WORKDIR /app
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs
COPY thirdparty/amrbart/requirements.yml .
COPY workbench/thirdparty/amrbart/requirements.yml .
RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \
apt-get update && apt-get install -y git-lfs && \
rm -rf /var/lib/apt/lists/* && \
Expand All @@ -11,9 +11,9 @@ RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \
git lfs install && \
git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMR2Text thirdparty/AMRBART-large-finetuned-AMR3.0-AMR2Text && \
conda run --no-capture-output -n amrbart python3 -m spacy download en_core_web_sm
COPY thirdparty/amrbart /app/thirdparty/amrbart
COPY workbench/thirdparty/amrbart /app/thirdparty/amrbart
RUN conda run --no-capture-output -n amrbart pip3 install -e thirdparty/amrbart/spring && \
conda run --no-capture-output -n amrbart python3 -c "from spring_amr.tokenization_bart import PENMANBartTokenizer; PENMANBartTokenizer.from_pretrained('facebook/bart-large')"
ENV AMR2TEXT_RPC=1
COPY . .
CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "semantic.py"]
COPY workbench/ ./workbench
CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "-m", "workbench.semantic"]
8 changes: 4 additions & 4 deletions build/Dockerfile.api
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ FROM python:3.7
WORKDIR /app
EXPOSE 50050
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs
COPY requirements-api.txt .
COPY requirements/api.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install -r requirements-api.txt && \
pip3 install -r requirements.txt && \
python3 -m spacy download en_core_web_sm
COPY . .
COPY workbench/ ./workbench
ENV PYTHONUNBUFFERED=TRUE
CMD ["gunicorn", "--workers=33", "--timeout", "1500", "--bind", "0.0.0.0:50050", "wsgi:create_app()", "--log-level", "debug"]
CMD ["gunicorn", "--workers=1", "--timeout", "1500", "--bind", "0.0.0.0:50050", "workbench.wsgi:create_app()", "--log-level", "debug"]
8 changes: 4 additions & 4 deletions build/Dockerfile.background
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
FROM python:3.7
WORKDIR /app
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs
COPY requirements-background.txt .
COPY requirements/background.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install scipy && \
pip3 install -r requirements-background.txt && \
pip3 install -r requirements.txt && \
python3 -m spacy download en_core_web_sm
COPY . .
CMD ["python3", "background.py"]
COPY workbench/ ./workbench
CMD ["python3", "-m", "workbench.background"]
8 changes: 4 additions & 4 deletions build/Dockerfile.linker
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ FROM python:3.7
WORKDIR /app
VOLUME [ "/app/db" ]
RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs
COPY requirements-linker.txt .
COPY requirements/linker.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install -r requirements-linker.txt && \
pip3 install -r requirements.txt && \
python3 -m spacy download en_core_web_sm
# cache sentence transformer model
RUN python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('multi-qa-mpnet-base-dot-v1')"
COPY . .
COPY workbench/ ./workbench
ENV PYTHONUNBUFFERED=TRUE
CMD ["python3", "linker.py"]
CMD ["python3", "-m", "workbench.linker"]
Loading

0 comments on commit 6b56762

Please sign in to comment.