diff --git a/.gitignore b/.gitignore index 0be931c..b7c1825 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,4 @@ snc/src/config.py tmp/ */tmp/ .env +**/.DS_Store \ No newline at end of file diff --git a/LICENSE b/LICENSE index f884453..1c6140c 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2022 WDM@UofA +Copyright (c) 2023 WDM@UofA Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 22ad0cf..e463853 100644 --- a/README.md +++ b/README.md @@ -1,120 +1,39 @@ nlpworkbench ==== +## What-Is +- [The purpose of this project](#about) +- [The layout of this project](#layout) +- [The call stack of an API call](#api-call-stack) ## How-To - [Deploy the whole thing on a single machine](#deployment) - [Add a new NLP tool](#extension) +- [Write and run tests](#testing) - [Move some NLP tools to another machine](#distributed-deployment) - [Backup and restore the workbench](#restoring-from-backups) -- [Understand an API call stack](#api-call-stack) + +## About +Please refer to the paper. ## Deployment Docker is the preferred way of deployment. -### Docker Requires a newer docker and docker compose plugin. Tested with docker v20.10.16. -On the host machine, prepare the folders for persisting data -```bash -mkdir /path/to/neo4j/data # folder for storing neo4j data -mkdir /path/to/es/data # folder for storing elasticsearch data -mkdir /path/to/sqlite/data # folder for storing embeddings -touch /path/to/sqlite/data/embeddings.sqlite3 # database file for storing embeddings - -mkdir -p /path/to/neo4j/certs/bolt/ # folder for storing neo4j certificates -cp /path/to/privkey.pem /path/to/neo4j/certs/bolt/private.key -cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/public.crt -cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/trusted/public.crt - -# change permission to writable -chmod a+rwx /path/to/neo4j/data -chmod a+rwx /path/to/es/data -chmod a+rwx /path/to/sqlite/data/embeddings.sqlite3 - -chown -R 7474:7474 /path/to/neo4j/certs/ -# change permissions of neo4j certificates following https://neo4j.com/docs/operations-manual/current/security/ssl-framework/#ssl-bolt-config -# just for example, -chmod 0755 /path/to/neo4j/certs/bolt/private.key -``` - -Modify `docker-compose.yml` file to mount the volumes to the correct locations (the folders you created above). Search for `volumes:` or `# CHANGE THIS` in `docker-compose.yml` and replace `source: ` with the correct path. - -Follow this [document](https://www.elastic.co/guide/en/kibana/current/docker.html) to set elasticsearch passwords and generate enrollment tokens for kibana. -```bash -# set password for user elastic -docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic -i -# set password for user kibana_system -docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u kibana_system -i -# generate an enrollment token for kibana -docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token -s kibana -``` -Open kibana in a browser and use the enrollment token to set up kibana. - - -Modify the port mapping in `docker-compose.yml` file under `services -> frontend -> ports` to change the exposed port. The current one is 8080, which means `http://localhost:8080` is the url for the workbench. - Clone the repositories and build docker images: ```bash -# clone thirdparty nlp tools and frontend -git submodule init -git submodule update # build images -docker compose --profile non-gpu --profile gpu build +docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu build # run -docker compose --profile non-gpu --profile gpu up +docker compose -f docker-compose.dev.yml --profile non-gpu --profile gpu up ``` -Finally, run some quick tests. The entity linker will not work because the knowledge graph is empty. "Feelin' Lucky" also will not work because the article collection in ES is empty. But entity recognition, semantic parsing, and relation extraction should be working fine. - -Paste a news link in the input box and click "Load News" to run some tests. - -### Manual deployment -You shouldn't need to deploy things manually. The docker service does all the steps here. - -#### Run a development server -Create a new virtual environment and run `pip3 install -r requirements-api.txt` - -Use `flask run --host=127.0.0.1 --port=3000` to start a development server. - -#### Deploy API server for production -1. Install nginx, and configure SSL certificates (probably with let's encrypt). -2. Put `confs/nginx-conf` in `/etc/nginx/sites-enabled/`. You can do this by `ln -s confs/nginx-conf /etc/nginx/sites-enabled/nlpworkbench-backend` -3. Run `sudo nginx -s reload` to load the site configuration. -4. Put `confs/demo-api.service` in `/etc/systemd/system/`. You can do this by `ln -s confs/demo-api.service /etc/systemd/system/demo-api.service`. -5. Run `sudo systemd daemon-reload` to load the service configuration. -6. Run `sudo systemd start demo-api` to start the backend. - -#### Setting up components -##### Article DB -Articles are stored and indexed in Elasticsearch. After setting up Elasticsearch, modify config.py and point `es_url` and `es_auth` to the es server, and change `es_article_collection` to the collection name. - -##### Named entity recognition -Download the pre-trained model from [https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip](https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip) and decompress it. - -Clone the NER system from `git@gitlab.com:UAlberta/nlpwokkbench/pure-ner.git`. - -In `config.py`, point `ner_script` to `run_ner.py` in the PURE NER repository, and `ner_model` to the model folder. - -##### Entity linking -Entities are indexed in elasticsearch by their aliases. Point `es_entity_collection` to the entity collection in ES. - -After generating candidates from a ES query, our system rescores all candidates by comparing sentence embeddings. For this we'll need pre-computed embeddings for all the entities (stored in an sqlite db, [download here](https://drive.google.com/file/d/17cvpeiwifVMBJ-Sidqq_f-pskvyyAlhv/view?usp=sharing)). Change `embeddings_db` in `config.py` to the sqlite file. - -Neo4j stores entity attributes and descriptions. Set `neo4j_url` and `neo4j_auth` properly. - -##### Semantic parsing -Clone `git@gitlab.com:UAlberta/nlpwokkbench/amrbart.git` - -Download the pre-trained model by running -```bash -# apt install git-lfs -git lfs install -git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMR2Text -``` +The workbench should be up and running on `http://localhost:8085`. Paste a news link in the input box and click "Load News" to run some tests. -Point `amr_script` to `inference_amr.py` in the AMRBART repository, and set `amr_model` properly. +Without further configurations some parts will not be working: Kibana needs to be paired with Elasticsearch. The entity linker will not work because the knowledge graph is empty. "Feelin' Lucky" also will not work because the article collection in ES is empty. But entity recognition, semantic parsing, and relation extraction should be working fine. +By default docker creates temporary volumes to store data. During production we want to persist things, and this is done by binding locations on the host to the containers. We also need to configure Neo4j and Kibana for pair with the servers. Details on deploying in production mode are documented [here](docs/deploy-production.md). ## Extension -![Architecture](arch.svg) +![Architecture](docs/arch.svg) The architecture of the nlpworkbench is shown in the above figure. Each NLP tool / model runs in its independent container, and communicates with the API server using Celery, or alternatively any protocol you like. The goal of using Celery is that you can move any Python function to any physical machine and @@ -275,7 +194,7 @@ docker run -it --rm \ **neo4j image version must match dump version and dbms version!!** ## API Call Stack -This [diagram](callstack.pdf) shows what's happening behind the scene when an API call is made to run NER on a document. +This [diagram](docs/callstack.pdf) shows what's happening behind the scene when an API call is made to run NER on a document. When a REST API is called, the NGINX reverse proxy (running in `frontend` container) decrypts the HTTPS request, and passes it to the `api` container. Inside the `api` container, `gunicorn` passes the request to one of the Flask server processes. Numbers below correspond to labels in the diagram. @@ -290,4 +209,117 @@ When a REST API is called, the NGINX reverse proxy (running in `frontend` contai 9. `call` function is a wrapper in PURE NER's code base. PURE NER initially can only be called via command line (handled in `__main__`) and this wrapper function pretends inputs are from the command line. 10. We also have lazy loading helper functions so that models are only loaded once. 11. Output of PURE NER is automatically stored in Elasticsearch by the `es_cache` decorator. -12. NER output is formatted to suit the need of the frontend, and responded to the user. \ No newline at end of file +12. NER output is formatted to suit the need of the frontend, and responded to the user. + +## Layout +``` +build/ + | + |-- Dockerfile.api + |-- Dockerfile.service +frontend/ +requirements/ + | + |-- api.txt + |-- service.txt +workbench/ + | + |-- __init__.py + |-- rpc.py + |-- snc/ + | + |-- __init__.py + |-- thirdparty/ + | -- amrbart/ + | -- pure-ner/ +docker-compose.yml +``` + +`build/` folder contains all `Dockerfile`s and `requirements/` folder contains `requirements.txt` for each micro-service. + +`workbench/` contains all Python code. The folder and all of its subfolders (except `thirdparty/`) are [Python packages](https://docs.python.org/3/tutorial/modules.html#packages). A `__init__.py` file must be present in every subfolder. [Relative imports](https://docs.python.org/3/tutorial/modules.html#intra-package-references) (`from . import config`, or `from ..rpc import create_celery`) are the preferred way to reference modules within the `workbench` package. + +## Testing +We are moving towards test-driven development. The infrastructure for unit tests are available.te + +### Writing unit tests +We use the [pytest](https://docs.pytest.org/en/7.2.x/) framework to test Python code. It is a very light-weight framework: to write tests one would create a file `tests/test_some_module.py` containing functions with name `test_some_feature` and writing `assert` statements. + +Here's a snippet from `tests/test_sentiment.py` that tests the VADER sentiment analyzer: +```python +from workbench import vader + +def test_classify_positive_sents(): + positive_sents = [ + "I love this product.", + "Fantastic!", + "I am so happy.", + "This is a great movie." + ] + for sent in positive_sents: + output = vader.run_vader(sent) + assert output["polarity_compound"] > 0.3 +``` + +Running `python3 -m pytest tests/test_sentiments.py` will provide a report for this set of unit tests like: +``` +==================== test session starts ==================== +platform linux -- Python 3.9.12, pytest-7.1.1, pluggy-1.0.0 +rootdir: /data/local/workbench-dev, configfile: pyproject.toml +plugins: anyio-3.5.0 +collected 3 items + +tests/test_sentiment.py ... [100%] +============== 3 passed, 3 warnings in 0.37s ============== +``` + +In the real world we don't directly run code, and instead we use Docker. Unit test is added into a docker image separate from the image used to run the service, by using [multi-stage build](https://docs.docker.com/language/java/run-tests/). Still using VADER as the example, the Dockerfile after adding tests becomes: +```Dockerfile +FROM python:3.7 AS base +WORKDIR /app +RUN mkdir /app/cache && mkdir /app/vader_log && mkdir /app/lightning_logs +COPY requirements/vader.txt requirements.txt +RUN --mount=type=cache,target=/root/.cache/pip \ + pip3 install -r requirements.txt +COPY workbench/ ./workbench +ENV PYTHONUNBUFFERED=TRUE + +FROM base as prod +CMD ["python3", "-m", "workbench.vader"] + +FROM base as test +COPY .coveragerc ./ +RUN --mount=type=cache,target=/root/.cache/pip \ + pip3 install pytest==7.2 coverage==7.0.5 +COPY tests ./tests +CMD ["coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_sentiment.py"] +``` +The `base` image contains all the source code and dependencies to run VADER. The `prod` image starts the process that serves VADER. The `test` image is used to run testing, where tests are added and test frameworks are installed. Running a container with the `test` image will invoke the tests. + +After adding multi-stage build, `docker-compose.dev.yml` needs to be changed to specify the default build stage as `prod`: +```yaml + vader: + build: + dockerfile: ./build/Dockerfile.vader + target: ${COMPOSE_TARGET:-prod} +``` + +### Run tests locally +`run-test.sh` provides the scripts to run tests on your local machine using Docker. To test VADER: +```bash +./run-test.sh build vader # build the `test` stage image for vader +./run-test.sh test vader # run a container with the `test` image +# repeat the process for other services. +# `vader` can be replaced with other services defined in `docker-compose.dev.yml` +./run-test.sh coverage # combine coverage info from all tests and print coverage report +``` + +### Automated testing +Once your commits are pushed to GitLab, a pipeline is triggered to automatically run tests. The pipeline badge indicates whether tests are passed, and the coverage badge shows the line coverage percentage. + +![pipeline](https://gitlab.com/UAlberta/nlpwokkbench/workbench-api/badges/dev/pipeline.svg) +![coverage](https://gitlab.com/UAlberta/nlpwokkbench/workbench-api/badges/dev/coverage.svg) + +The tests triggered by the push are defined in `.gitlab-ci.yml`. When adding new tests, a `build-something` job and a `test-something` job should be added following the structure of existing jobs in the file. + +The test jobs will be executed by a local runner on one of our own machines (rather than on a shared runner provided by GitLab). The local GitLab Runner is installed on our machines as a Docker container, following [the official tutorial](https://docs.gitlab.com/runner/install/docker.html). Our local runner is then [registered with the repository](https://docs.gitlab.com/runner/register/index.html). The default image for the docker executors is `docker:20.10.16`. We are using [Docker socket binding](https://docs.gitlab.com/ee/ci/docker/using_docker_build.html#use-docker-socket-binding) so that docker images / containers created within docker containers will be running **on the host system**, instead of becoming nested containers. This is beneficial for caching and reusing layers. \ No newline at end of file diff --git a/arch.svg b/arch.svg deleted file mode 100644 index 8125597..0000000 --- a/arch.svg +++ /dev/null @@ -1 +0,0 @@ - \ No newline at end of file diff --git a/build/Dockerfile.amr b/build/Dockerfile.amr index 7cf47c4..8820107 100644 --- a/build/Dockerfile.amr +++ b/build/Dockerfile.amr @@ -1,7 +1,7 @@ FROM continuumio/miniconda3:4.10.3 WORKDIR /app -RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY thirdparty/amrbart/requirements.yml . +RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs && mkdir -p /app/workbench/thirdparty +COPY workbench/thirdparty/amrbart/requirements.yml . RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \ apt-get update && apt-get install -y git-lfs && \ rm -rf /var/lib/apt/lists/* && \ @@ -9,11 +9,11 @@ RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \ conda init bash && \ conda env update --file requirements.yml --name amrbart && \ git lfs install && \ - git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing thirdparty/AMRBART-large-finetuned-AMR3.0-AMRParsing && \ + git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMRParsing workbench/thirdparty/AMRBART-large-finetuned-AMR3.0-AMRParsing && \ conda run --no-capture-output -n amrbart python3 -m spacy download en_core_web_sm -COPY thirdparty/amrbart /app/thirdparty/amrbart -RUN conda run --no-capture-output -n amrbart pip3 install -e thirdparty/amrbart/spring && \ +COPY workbench/thirdparty/amrbart /app/workbench/thirdparty/amrbart +RUN conda run --no-capture-output -n amrbart pip3 install -e workbench/thirdparty/amrbart/spring && \ conda run --no-capture-output -n amrbart python3 -c "from spring_amr.tokenization_bart import PENMANBartTokenizer; PENMANBartTokenizer.from_pretrained('facebook/bart-large')" ENV AMRPARSING_RPC=1 -COPY . . -CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "semantic.py"] +COPY workbench/ ./workbench +CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "-m", "workbench.semantic"] diff --git a/build/Dockerfile.amr2text b/build/Dockerfile.amr2text index a4308d8..63fc4c9 100644 --- a/build/Dockerfile.amr2text +++ b/build/Dockerfile.amr2text @@ -1,7 +1,7 @@ FROM continuumio/miniconda3:4.10.3 WORKDIR /app RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY thirdparty/amrbart/requirements.yml . +COPY workbench/thirdparty/amrbart/requirements.yml . RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \ apt-get update && apt-get install -y git-lfs && \ rm -rf /var/lib/apt/lists/* && \ @@ -11,9 +11,9 @@ RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \ git lfs install && \ git clone https://huggingface.co/xfbai/AMRBART-large-finetuned-AMR3.0-AMR2Text thirdparty/AMRBART-large-finetuned-AMR3.0-AMR2Text && \ conda run --no-capture-output -n amrbart python3 -m spacy download en_core_web_sm -COPY thirdparty/amrbart /app/thirdparty/amrbart +COPY workbench/thirdparty/amrbart /app/thirdparty/amrbart RUN conda run --no-capture-output -n amrbart pip3 install -e thirdparty/amrbart/spring && \ conda run --no-capture-output -n amrbart python3 -c "from spring_amr.tokenization_bart import PENMANBartTokenizer; PENMANBartTokenizer.from_pretrained('facebook/bart-large')" ENV AMR2TEXT_RPC=1 -COPY . . -CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "semantic.py"] +COPY workbench/ ./workbench +CMD ["conda", "run", "--no-capture-output", "-n", "amrbart", "python3", "-m", "workbench.semantic"] diff --git a/build/Dockerfile.api b/build/Dockerfile.api index 423253b..5e7295a 100644 --- a/build/Dockerfile.api +++ b/build/Dockerfile.api @@ -2,10 +2,10 @@ FROM python:3.7 WORKDIR /app EXPOSE 50050 RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY requirements-api.txt . +COPY requirements/api.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ - pip3 install -r requirements-api.txt && \ + pip3 install -r requirements.txt && \ python3 -m spacy download en_core_web_sm -COPY . . +COPY workbench/ ./workbench ENV PYTHONUNBUFFERED=TRUE -CMD ["gunicorn", "--workers=33", "--timeout", "1500", "--bind", "0.0.0.0:50050", "wsgi:create_app()", "--log-level", "debug"] +CMD ["gunicorn", "--workers=1", "--timeout", "1500", "--bind", "0.0.0.0:50050", "workbench.wsgi:create_app()", "--log-level", "debug"] diff --git a/build/Dockerfile.background b/build/Dockerfile.background index 1217b7c..5ba9810 100644 --- a/build/Dockerfile.background +++ b/build/Dockerfile.background @@ -1,10 +1,10 @@ FROM python:3.7 WORKDIR /app RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY requirements-background.txt . +COPY requirements/background.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ pip3 install scipy && \ - pip3 install -r requirements-background.txt && \ + pip3 install -r requirements.txt && \ python3 -m spacy download en_core_web_sm -COPY . . -CMD ["python3", "background.py"] \ No newline at end of file +COPY workbench/ ./workbench +CMD ["python3", "-m", "workbench.background"] \ No newline at end of file diff --git a/build/Dockerfile.linker b/build/Dockerfile.linker index 5fb9170..8317f35 100644 --- a/build/Dockerfile.linker +++ b/build/Dockerfile.linker @@ -2,12 +2,12 @@ FROM python:3.7 WORKDIR /app VOLUME [ "/app/db" ] RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY requirements-linker.txt . +COPY requirements/linker.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ - pip3 install -r requirements-linker.txt && \ + pip3 install -r requirements.txt && \ python3 -m spacy download en_core_web_sm # cache sentence transformer model RUN python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('multi-qa-mpnet-base-dot-v1')" -COPY . . +COPY workbench/ ./workbench ENV PYTHONUNBUFFERED=TRUE -CMD ["python3", "linker.py"] \ No newline at end of file +CMD ["python3", "-m", "workbench.linker"] \ No newline at end of file diff --git a/build/Dockerfile.ner b/build/Dockerfile.ner index 9ddfd57..3e63606 100644 --- a/build/Dockerfile.ner +++ b/build/Dockerfile.ner @@ -1,17 +1,26 @@ -FROM continuumio/miniconda3:4.10.3 +FROM continuumio/miniconda3:4.10.3 AS base WORKDIR /app RUN --mount=type=cache,target=/opt/conda/pkgs,sharing=private \ apt-get update && apt-get install -y unzip build-essential && rm -rf /var/lib/apt/lists/* && \ conda install python=3.7 pytorch==1.11.0 cudatoolkit=11.3 -c pytorch -RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs && mkdir /app/thirdparty -COPY requirements-ner.txt . +RUN mkdir /app/cache && mkdir /app/ner_log && mkdir /app/runs && mkdir /app/lightning_logs && mkdir -p /app/workbench/thirdparty +COPY requirements/ner.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ conda run --no-capture-output pip3 install scipy && \ - conda run --no-capture-output pip3 install -r requirements-ner.txt && \ + conda run --no-capture-output pip3 install -r requirements.txt && \ conda run --no-capture-output python3 -m spacy download en_core_web_sm RUN --mount=type=cache,target=/root/.cache/huggingface \ conda run --no-capture-output python3 -c "from transformers import AutoModel, AutoTokenizer; AutoModel.from_pretrained('bert-base-uncased'); AutoTokenizer.from_pretrained('bert-base-uncased')" -RUN wget https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip -O /app/thirdparty/ent-bert-ctx300.zip -nv && \ - unzip /app/thirdparty/ent-bert-ctx300.zip -d /app/thirdparty/ -COPY . . -CMD ["conda", "run", "--no-capture-output", "python3", "ner.py"] +RUN wget https://nlp.cs.princeton.edu/projects/pure/ace05_models/ent-bert-ctx300.zip -O /app/workbench/thirdparty/ent-bert-ctx300.zip -nv && \ + unzip /app/workbench/thirdparty/ent-bert-ctx300.zip -d /app/workbench/thirdparty/ +COPY workbench/ ./workbench + +FROM base AS test +COPY .coveragerc ./ +RUN --mount=type=cache,target=/root/.cache/pip \ + conda run --no-capture-output pip3 install pytest==7.2 coverage==7.0.5 +COPY tests ./tests +CMD ["conda", "run", "--no-capture-output", "coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_ner.py"] + +FROM base AS prod +CMD ["conda", "run", "--no-capture-output", "python3", "-m" "workbench.ner"] \ No newline at end of file diff --git a/build/Dockerfile.patching b/build/Dockerfile.patching new file mode 100644 index 0000000..26043d2 --- /dev/null +++ b/build/Dockerfile.patching @@ -0,0 +1,5 @@ +# environment for running patch_compose.py +FROM python:3.9-alpine +RUN pip install pyyaml +ADD patch_compose.py /patch_compose.py +ENTRYPOINT [ "python3", "/patch_compose.py" ] \ No newline at end of file diff --git a/relation_extraction/Dockerfile b/build/Dockerfile.rel similarity index 69% rename from relation_extraction/Dockerfile rename to build/Dockerfile.rel index 4067d5d..ab2c462 100644 --- a/relation_extraction/Dockerfile +++ b/build/Dockerfile.rel @@ -1,11 +1,11 @@ FROM python:3.7 WORKDIR /app RUN mkdir /app/cache && mkdir /app/re_log && mkdir /app/runs -COPY requirements.txt . +COPY requirements/rel.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ pip3 install --use-feature=fast-deps -r requirements.txt && \ python3 -m spacy download en_core_web_lg ENV PYTHONUNBUFFERED=TRUE WORKDIR /app -COPY . . -CMD ["python3", "run.py"] \ No newline at end of file +COPY workbench/ ./workbench +CMD ["python3", "-m", "workbench.relation_extraction"] \ No newline at end of file diff --git a/build/Dockerfile.snc b/build/Dockerfile.snc index ace3589..fcbf2e5 100644 --- a/build/Dockerfile.snc +++ b/build/Dockerfile.snc @@ -1,10 +1,18 @@ -FROM python:3.9.15 +FROM python:3.9 AS base WORKDIR /app -RUN mkdir /app/data && mkdir /app/src -COPY src/ /app/src/ -COPY /data /app/data/ -COPY requirements-snc.txt . -RUN python3 -m pip install --upgrade pip setuptools wheel -RUN python3 -m pip install -r requirements-snc.txt -COPY snc.py . -CMD ["python3", "snc.py"] \ No newline at end of file +RUN mkdir /app/data +COPY workbench/snc/requirements.txt . +RUN --mount=type=cache,target=/root/.cache/pip \ + python3 -m pip install --upgrade pip setuptools wheel && \ + python3 -m pip install -r requirements.txt +COPY workbench/ ./workbench + +FROM base as test +COPY .coveragerc ./ +RUN --mount=type=cache,target=/root/.cache/pip \ + pip3 install pytest==7.2 coverage==7.0.5 +COPY tests ./tests +CMD ["coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_snc.py"] + +FROM base AS prod +CMD ["python3", "-m", "workbench.snc"] \ No newline at end of file diff --git a/build/Dockerfile.vader b/build/Dockerfile.vader index 30db7ab..eef25a5 100644 --- a/build/Dockerfile.vader +++ b/build/Dockerfile.vader @@ -1,10 +1,18 @@ -FROM python:3.7 +FROM python:3.7 AS base WORKDIR /app -VOLUME [ "/app/db" ] -RUN mkdir /app/cache && mkdir /app/vader_log && mkdir /app/runs && mkdir /app/lightning_logs -COPY requirements-vader.txt . +RUN mkdir /app/cache && mkdir /app/vader_log && mkdir /app/lightning_logs +COPY requirements/vader.txt requirements.txt RUN --mount=type=cache,target=/root/.cache/pip \ - pip3 install -r requirements-vader.txt -COPY . . + pip3 install -r requirements.txt +COPY workbench/ ./workbench ENV PYTHONUNBUFFERED=TRUE -CMD ["python3", "vader.py"] + +FROM base as test +COPY .coveragerc ./ +RUN --mount=type=cache,target=/root/.cache/pip \ + pip3 install pytest==7.2 coverage==7.0.5 +COPY tests ./tests +CMD ["coverage", "run", "--data-file=cov/.coverage", "--source=workbench/", "--module", "pytest", "tests/test_sentiment.py"] + +FROM base as prod +CMD ["python3", "-m", "workbench.vader"] \ No newline at end of file diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml new file mode 100644 index 0000000..f474bf7 --- /dev/null +++ b/docker-compose.dev.yml @@ -0,0 +1,195 @@ +version: "3" +name: "nlp-workbench-dev" +services: + api: + ports: + - "50050:50050" + build: + dockerfile: ./build/Dockerfile.api + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - non-gpu + - debug + + frontend: + build: + context: ./frontend + dockerfile: ./Dockerfile + depends_on: + - api + - neo4j + - kibana + - flower + ports: + - "8085:80" + profiles: + - non-gpu + - debug + + ner: + build: + dockerfile: ./build/Dockerfile.ner + target: ${COMPOSE_TARGET:-prod} + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - gpu + + linker: + build: + dockerfile: ./build/Dockerfile.linker + depends_on: + - redis + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - non-gpu + + amr: + build: + dockerfile: ./build/Dockerfile.amr + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - gpu + + amr2text: + build: + dockerfile: ./build/Dockerfile.amr2text + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - gpu + + vader: + build: + dockerfile: ./build/Dockerfile.vader + target: ${COMPOSE_TARGET:-prod} + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - non-gpu + + relation_extraction: + build: + dockerfile: ./build/Dockerfile.rel + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - non-gpu + + background: + build: + dockerfile: ./build/Dockerfile.background + depends_on: + - redis + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + profiles: + - non-gpu + + flower: + image: mher/flower:latest + command: celery --broker=${RPC_BROKER:-redis://redis} --result-backend=redis://redis${RPC_BROKER:-redis://redis} flower --url_prefix=flower + ports: + - 5555 # unpublish port + depends_on: + - redis + profiles: + - non-gpu + - debug + + redis: + image: redis:7.0-alpine + command: redis-server --save "" --appendonly no ${REDIS_AUTH} + ports: + - 8081:6379 + profiles: + - non-gpu + - debug + + neo4j: + build: + context: ./neo4j + ports: + - 7687 + - 7474 + stop_grace_period: 2m + profiles: + - non-gpu + - debug + + snc-neo4j: + build: + context: ./neo4j + environment: + - NEO4J_AUTH=neo4j/snc123 + - NEO4J_dbms_security_procedures_unrestricted=gds.*,apoc.* + - NEO4J_dbms_security_procedures_allowlist=gds.*,apoc.* + - NEO4JLABS_PLUGINS=["apoc", "graph-data-science"] + profiles: + - non-gpu + - debug + stop_grace_period: 2m + + snc: + build: + dockerfile: ./build/Dockerfile.snc + target: ${COMPOSE_TARGET:-prod} + profiles: + - non-gpu + - debug + environment: + - RPC_BROKER=${RPC_BROKER:-redis://redis} + - RPC_BACKEND=${RPC_BACKEND:-redis://redis} + - BEARER_TOKEN=${BEARER_TOKEN} + - ELASTIC_PASSWORD=${ELASTIC_PASSWORD} + depends_on: + - snc-neo4j + - redis + - elasticsearch + + elasticsearch: + image: "elasticsearch:8.2.2" + environment: + - discovery.type=single-node + - xpack.security.enabled=false # disable SSL + - xpack.security.http.ssl.enabled=false # disable SSL + - xpack.security.transport.ssl.enabled=false # disable SSL + - bootstrap.memory_lock=true # disable swap + - logger.level=ERROR + #- path.repo=/repo # for restoring from backup + mem_limit: "16g" # TODO: increase later + ulimits: + memlock: + soft: -1 + hard: -1 + profiles: + - non-gpu + - debug + + kibana: + image: "kibana:8.2.2" + ports: + - 5601 # unpublish port + environment: + - SERVER_NAME=kibana + - SERVER_BASEPATH=/kibana + - ELASTICSEARCH_HOSTS=http://elasticsearch:9200 + - ELASTICSEARCH_USERNAME=kibana_system + - ELASTICSEARCH_PASSWORD=kibana # CHANGE THIS + - LOGGING_ROOT_LEVEL=warn + depends_on: + - elasticsearch + profiles: + - non-gpu + - debug diff --git a/docker-compose.yml b/docker-compose.yml index 02d2b79..625cc3a 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -113,6 +113,7 @@ services: vader: build: dockerfile: ./build/Dockerfile.vader + target: ${COMPOSE_TARGET:-prod} environment: - RPC_BROKER=${RPC_BROKER:-"redis://redis"} - RPC_BACKEND=${RPC_BACKEND:-"redis://redis"} @@ -127,7 +128,7 @@ services: relation_extraction: build: - context: ./relation_extraction + context: ./workbench/relation_extraction environment: - RPC_BROKER=${RPC_BROKER:-"redis://redis"} - RPC_BACKEND=${RPC_BACKEND:-"redis://redis"} @@ -208,7 +209,7 @@ services: source: /home/ubuntu/workbench/docker-data/neo4j-certs # CHANGE THIS target: /var/lib/neo4j/certificates environment: - - NEO4J_dbms_connector_bolt_client__auth=NONE + - NEO4J_dbms_connector_bolt_client__auth=OPTIONAL - NEO4J_dbms_connector_bolt_listen__address=0.0.0.0:7687 - NEO4J_dbms_connector_bolt_advertised__address=newskg.wdmuofa.ca:9201 - NEO4J_dbms_connector_bolt_tls__level=OPTIONAL @@ -242,8 +243,7 @@ services: snc: build: - dockerfile: Dockerfile.snc - context: ./snc/ + dockerfile: ./build/Dockerfile.snc profiles: - non-gpu - debug diff --git a/docs/arch.svg b/docs/arch.svg new file mode 100644 index 0000000..d044f69 --- /dev/null +++ b/docs/arch.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/callstack.pdf b/docs/callstack.pdf similarity index 100% rename from callstack.pdf rename to docs/callstack.pdf diff --git a/docs/deploy-production.md b/docs/deploy-production.md new file mode 100644 index 0000000..ee79421 --- /dev/null +++ b/docs/deploy-production.md @@ -0,0 +1,47 @@ +On the host machine, prepare the folders for persisting data +```bash +mkdir /path/to/neo4j/data # folder for storing neo4j data +mkdir /path/to/es/data # folder for storing elasticsearch data +mkdir /path/to/sqlite/data # folder for storing embeddings +touch /path/to/sqlite/data/embeddings.sqlite3 # database file for storing embeddings + +mkdir -p /path/to/neo4j/certs/bolt/ # folder for storing neo4j certificates +cp /path/to/privkey.pem /path/to/neo4j/certs/bolt/private.key +cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/public.crt +cp /path/to/fullchain.pem /path/to/neo4j/certs/bolt/trusted/public.crt + +# change permission to writable +chmod a+rwx /path/to/neo4j/data +chmod a+rwx /path/to/es/data +chmod a+rwx /path/to/sqlite/data/embeddings.sqlite3 + +chown -R 7474:7474 /path/to/neo4j/certs/ +# change permissions of neo4j certificates following https://neo4j.com/docs/operations-manual/current/security/ssl-framework/#ssl-bolt-config +# just for example, +chmod 0755 /path/to/neo4j/certs/bolt/private.key +``` + +Modify `docker-compose.yml` file to mount the volumes to the correct locations (the folders you created above). Search for `volumes:` or `# CHANGE THIS` in `docker-compose.yml` and replace `source: ` with the correct path. + +Follow this [document](https://www.elastic.co/guide/en/kibana/current/docker.html) to set elasticsearch passwords and generate enrollment tokens for kibana. +```bash +# set password for user elastic +docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic -i +# set password for user kibana_system +docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-reset-password -u kibana_system -i +# generate an enrollment token for kibana +docker exec -it nlp-workbench-elasticsearch-1 /usr/share/elasticsearch/bin/elasticsearch-create-enrollment-token -s kibana +``` +Open kibana in a browser and use the enrollment token to set up kibana. + + +Modify the port mapping in `docker-compose.yml` file under `services -> frontend -> ports` to change the exposed port. The current one is 8080, which means `http://localhost:8080` is the url for the workbench. + + +Clone the repositories and build docker images: +```bash +# build images +docker compose --profile non-gpu --profile gpu build +# run +docker compose --profile non-gpu --profile gpu up +``` \ No newline at end of file diff --git a/extract_text.py b/extract_text.py deleted file mode 100644 index dd099a2..0000000 --- a/extract_text.py +++ /dev/null @@ -1,123 +0,0 @@ -import itertools - -from semantic import * - - -# TODO: -# better way to find predicates for triples -# handle ARG-of -# something with dates - -text = """ -(z0 / and - :op1 (z1 / close-01 - :ARG0 (z2 / company - :wiki "Google" - :name (z3 / name - :op1 "Google" - :op2 "Inc")) - :ARG1 (z4 / service - :purpose (z5 / search-01) - :ARG1-of (z6 / base-01 - :location (z7 / country - :wiki "China" - :name (z8 / name - :op1 "China"))) - :poss z2) - :time (z9 / date-entity - :weekday (z10 / monday))) - :op2 (z11 / begin-01 - :ARG0 z2 - :ARG1 (z12 / redirect-01 - :ARG0 z2 - :ARG1 (z13 / person - :ARG0-of (z14 / search-01 - :ARG1 (z15 / web))) - :ARG2 (z16 / site - :ARG1-of (z17 / censor-01 - :polarity -) - :location (z18 / city - :wiki "Hong_Kong" - :name (z19 / name - :op1 "Hong" - :op2 "Kong"))))) - :ARG0-of (z20 / cause-01 - :ARG1 (z21 / draw-02 - :ARG0 z2 - :ARG1 (z22 / thing - :ARG1-of (z23 / comment-01 - :ARG0 (z24 / city - :wiki "Beijing" - :name (z25 / name - :op1 "Beijing"))) - :ARG1-of (z26 / harsh-02) - :ARG0-of (z27 / raise-01 - :ARG1 (z28 / doubt-01 - :ARG1 (z29 / future - :poss z2 - :location (z30 / market - :mod (z31 / internet) - :ARG1-of (z32 / have-degree-91 - :ARG2 (z33 / large) - :ARG3 (z34 / most) - :ARG5 (z35 / market - :location (z36 / world)))))))))) - :ARG1-of (z37 / describe-01 - :ARG0 (z38 / publication - :wiki "Reuters" - :name (z39 / name - :op1 "Reuters")))) -""" - -tree = parse_single_amr_output(text) -graph = amr_tree_to_graph(tree) - -def is_verb(node): - # TODO: not like this - return type(node) is not AMRConstant and '-0' in node.concept and len(node._edges) > 0 - -def extract_verb_triples(node): - if type(node) == AMRConstant: return - #print(node) - arg0 = node._edges[0].var2 - for edge in node._edges[1:]: - yield (arg0.name, node.name, edge.var2.name) - -def join_name(node): - return " ".join([e.var2.value for e in node._edges]) - -def build_alias_dict(graph): - alias_graph = {} - for node in graph.nodes: - - # not applicable to constants - if type(node) is AMRConstant: continue - - # add the "concept" to the dict - alias_graph[node.name] = alias_graph.get(node.name, []) + [node.concept] - - for edge in node._edges: - # add all "wiki" names - if edge.relationship == "wiki": - alias_graph[node.name] = alias_graph.get(node.name, []) + [edge.var2.value] - # add all "name" names - #if edge.relationship == "name": - # alias_graph[node.name] = alias_graph.get(node.name, []) + [join_name(edge.var2)] - return alias_graph - -def extract_all(graph): - triples = [] - for node in graph.nodes: - if is_verb(node): - triples.extend(list(extract_verb_triples(node))) - return triples - -def expand_triples(list_of_triples, alias_dict): - for triple in list_of_triples: - expanded = itertools.product(*[alias_dict[name] for name in triple]) - for e in expanded: print(e) - -if __name__ == "__main__": - triples = extract_all(graph) - alias_dict = build_alias_dict(graph) - expand_triples(triples, alias_dict) diff --git a/frontend/README.md b/frontend/README.md index 22dc4d1..cad7d1d 100644 --- a/frontend/README.md +++ b/frontend/README.md @@ -1,3 +1,5 @@ +**Archived - code moved to the main repository [https://gitlab.com/UAlberta/nlpwokkbench/workbench-api](https://gitlab.com/UAlberta/nlpwokkbench/workbench-api)** + # newskg-demo-web Web interface for the news KG pipeline demo. diff --git a/frontend/snc/package.json b/frontend/snc/package.json index fe820dc..3ff29d3 100644 --- a/frontend/snc/package.json +++ b/frontend/snc/package.json @@ -8,7 +8,7 @@ "@fontsource/roboto": "^4.5.8", "@mui/icons-material": "^5.10.9", "@mui/lab": "^5.0.0-alpha.105", - "@mui/material": "^5.10.11", + "@mui/material": "^5.10", "@testing-library/jest-dom": "^5.16.5", "@testing-library/react": "^13.4.0", "@testing-library/user-event": "^13.5.0", diff --git a/frontend/snc/public/index.html b/frontend/snc/public/index.html index a70c020..188a9af 100644 --- a/frontend/snc/public/index.html +++ b/frontend/snc/public/index.html @@ -2,19 +2,19 @@
- + - + - +