Project Setup

This page describes how you get and set the right Azure resource keys to make Verseagility work end-to-end.

Grab Your Keys

In the root directory of the Verseagility repository, you will find a file named config.sample.ini, which has the following content:

[environ]
aml-ws-name=
aml-ws-sid=
aml-ws-rg=
text-analytics-name=
text-analytics-key=
cosmos-db-name=
cosmos-db-key=

Create a copy of this file and name it config.ini. Store it in the same folder as the config.sample.ini
Go to your newly created resource group in your Azure portal and enter following resources one after another:

Azure Machine Learning
- The main page of Azure Machine Learning has all the information you need, starting with the name of your resource, which you have to insert in the aml-ws-name.
- Secondly, you have to insert your subscription ID as stated in the screenshot below in the second row aml-ws-sid.
- Also, you have to insert the name of your resource group in the line aml-ws-rg.
Text Analytics Service
- At the main page called Quick start, copy the following elements:
- Copy Key1 and insert it in the text-analytics-key row of your ini-file.
- Mark and copy the Endpoint name of the text analytics service as stated in the frame and insert it in the text-analytics-name of your file. You do not need the whole URL, the name is sufficient.
- Please verify with the icon you see below that you are actually using the Text Analytics resource instead of the Computer Vision.
Cosmos DB
- After entering your Cosmos DB resource, click on Keys in the left menu.
- From there, copy the resource name from the URI-field, which matches to the name of the resource. Similarly to the Text Analytics Service, you do not need the entire URL, the name is sufficient. Insert it in the cosmos-db-name of your file.
- Last but not least, copy the Primary Key and set it in the cosmos-db-key row.

Your file should look similarly to this one (the values below are random):

[environ]
aml-ws-name=vers-aml-workspace-ikkavgc641vq2
aml-ws-sid=1dca2144-0815-3301-b6a5-8a97ef7632a5
aml-ws-rg=verseagility
text-analytics-name=verstaxhksd72s
text-analytics-key=9b82deffedca46bd9b4938cd8029a355
cosmos-db-name=verscosmosiqkkxah120vq2
cosmos-db-key=XH0mse9II5wYaXJMFb5sycyDcaAWwATwJTAdvVhBD18QdQaYsZe23mupgD378VZW751yHP6v4YbOZitgxSXSg==

Tasks

As described before, currently the following tasks are supported:

Text/Document classification
Named Entity Recognition
Question Answering The section below covers briefly what they consist of, which dependencies they have and how you can customize them further.

Classification

This section describes which models are used to train classification models. Both multi-class and multi-label approaches are supported and facilitated by the FARM framework. We primarily use so-called Transformer models to train classification assets.

Transformers

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each Python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other. In this setup, we use PyTorch.

The models used are defined in src/helper.py and the dictionary below can be extended by other model names and languages. The list of pretrained models for many purposes can be found on HuggingFace.

farm_model_lookup = {
    'bert': {
        'xx':'bert-base-multilingual-cased',
        'en':'bert-base-cased',
        'de':'bert-base-german-cased',
        'fr':'camembert-base',
        'cn':'bert-base-chinese'
        },
    'roberta' : {
        'en' : 'roberta-base',
        'de' : 'roberta-base',
        'fr' : 'roberta-base',
        'es' : 'roberta-base',
        'it' : 'roberta-base'
        # All languages for roberta because of multi_classificaiton
    },
    'xlm-roberta' : {
        'xx' : 'xlm-roberta-base'
    },
    'albert' : {
        'en' : 'albert-base-v2'
    },
    'distilbert' : {
        'xx' : 'distilbert-base-multilingual-cased',
        'de' : 'distilbert-base-german-cased'
    }
}

Named Entity Recognition

The toolkit supports and includes different approaches and frameworks for recognizing relevant entities in text paragraphs, called Named Entity Recognition, short NER:

Azure Text Analytics API
spaCy Pre-trained NER
flairNLP Pre-Trained NER
FARM/Transformer Custom NER
Basic approaches (like regex and term lists)

The central components can be found in the script code/ner.py.

NER using Azure Text Analytics API

Azure Text Analytics is a Cognitive Service providing an API-based extraction of relevant entities from texts. You can find the documentation for Azure Text Analytics API here. In order to use it within the NLP toolkit, you need to set up a Cognitive Service in your personal Azure subscription and insert the relevant subscription keys. A basic service is free, yet it has request limitations. You can find a description how to set your keys if you scroll up to the top of the page.

spaCy Pre-trained NER

We leverage the pre-trained models from spaCy, which is an open source framework to speed up your NER tasks. There is a language specific model referenced in src/helper.py. Besides the models, we leverage language pipelines by spaCy, also to be able to process multiple models.

flairNLP Pre-trained NER

Flair is a framework for state-of-the-art NLP tasks, especially NER. You can leverage powerful, pre-trained models and deploy them to your endpoint. Depending on your use case, you can choose different models and reference it in the lookup file in the src/helper.py.

FARM / Transformer Custom NER

Basic approaches

The most basic approach of Named Entity Recognition in text files is to take use of lists or regular expressions (regex). You can add your own terms to the toolkit scope by following these steps:

Go to the subfolder assets/ of the repository folder
You find two relevant files in there:

names.txt
- stop word list
ner.txt
- value-key-pairs of terms and their category, tab-delimited file

Open them in a text editor of your choice and add further value-key pairs to ner.txt or continue to extend the list of stop words in names.txt. Stop words are words which are filtered out before or after processing as they are too common and frequent for bringing value to the analysis.
Make sure the values are all lower-case, but the keys should be properly formatted

Question Answering

This section is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to

Ranking Algorithm

The current version of Verseagility supports the Okapi BM25 information retrieval algorithm to sort historical question answer pairs by relevance. BM25 is a ranking approach used by search engines to estimate the relevance of a document to a given search query, such as a text or document. This is implemented using the gensim library. The ranking framework is accessed by the file code/rank.py.

Potential Extensions

Due to the modular setup, this section can be extended to support the QNAMaker from Microsoft, or custom question answering algorithms using Transformers and FARM. Support for these may be added in coming versions of Verseagility.

Opinion Mining

The section below describes how you can enrich your Verseagility API with an opinion mining feature. Opinion mining is a feature of sentiment analysis. Also known as aspect-based sentiment analysis in Natural Language Processing (NLP), this feature provides more granular information about the opinions related to words (such as the attributes of products or services) in text.

OM using Azure Text Analytics API

Verseagility now supports the Opinion Mining feature of Azure Cognitive Services. All you need to bring is an API key from an Azure Text Analytics resource. Please see the Language Support page for information, whether your desired language is supported.

Project File

Create a config file for your end-to-end Verseagility-project in the subfolder project/ (located in the root folder of the repository). We recommend you to follow this naming convention: [name of your project]\_[language code, two letters].config.json -> msforum_en.config.json
See the following json-snippet as an example:

{
    "name":"msforum_en",
    "language": "en",
    "environment" : "dev",
    "data_dir" : "./run/",
    "prepare" : {
        "data_type" : "json"
    },
    "tasks": {
        "1": {
            "label": "subcat",
            "type": "classification",
            "model_type": "roberta",
            "max_seq_len": 256,
            "embeds_dropout":0.3,
            "learning_rate":3e-5,
            "prepare": true
        },
        "2": {
            "type": "multi_classification",
            "model_type": "roberta",
            "max_seq_len": 256,
            "embeds_dropout":0.3,
            "learning_rate":2e-5,
            "prepare": true
        },
        "3": {
            "type": "ner",
            "prepare": false,
            "models": ["textanalytics", "flair", "custom", "regex", "nerlist"]
        },
        "4": {
            "type": "qa",
            "model_type": "historical",
            "prepare": true
        },
        "5": {
            "type": "om",
            "prepare": false
        }
    },
        "deploy": {
            "type": "ACI",
            "memory": 2,
            "cpu": 1
        }
}

You see that there are multiple tasks in the project file. To help you understand how potential changes could look like, use following examples:

If you only want to go for classification, keep task 1 in the config
You would like to use a different, pre-trained BERT model in task 1 or 2? Adjust this one respectively "model_type": "roberta"
In case you do not want to integrate Multi Label Classification, Named Entity Recognition, Question/Answering, Opinion Mining..., simply remove it from your JSON
If you only want specific NER models to be considered when scoring a model, adjust the array of models ["textanalytics", "flair", "custom", "regex", "nerlist"] and only keep the ones you would like to use in the array.

See following example file below, only covering classification (task 1) and NER (task 3), only with Flair and RegEx.

{
    "name":"msforum_en",
    "language": "en",
    "environment" : "dev",
    "data_dir" : "./run/",
    "prepare" : {
        "data_type" : "json"
    },
    "tasks": {
        "1": {
            "label": "subcat",
            "type": "classification",
            "model_type": "roberta",
            "max_seq_len": 256,
            "embeds_dropout":0.3,
            "learning_rate":3e-5,
            "prepare": true
        },
        "3": {
            "type": "ner",
            "prepare": false,
            "models": ["flair", "regex"]
        }
    },
        "deploy": {
            "type": "ACI",
            "memory": 2,
            "cpu": 1
        }
}

After creating the json file, you need to do a slight change in the custom.py script:

Look for this line within the script:

params = he.get_project_config('[INSERT CONFIG NAME HERE]')

Insert the file name of the JSON (e.g. msforum_en.config.json) you just created in the first step of this section. You do not need to pass a folder name as long as the file is located in project/, which we highly recommend. The line of your code should look like this:

params = he.get_project_config('msforum_en.config.json')

Your project is now set up! Next, we will explain the Data Cleaning Steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03 - Project Setup.md

03 - Project Setup.md

Project Setup

Grab Your Keys

Tasks

Classification

Transformers

Named Entity Recognition

NER using Azure Text Analytics API

spaCy Pre-trained NER

flairNLP Pre-trained NER

FARM / Transformer Custom NER

Basic approaches

Question Answering

Ranking Algorithm

Potential Extensions

Opinion Mining

OM using Azure Text Analytics API

Project File

Files

03 - Project Setup.md

Latest commit

History

03 - Project Setup.md

File metadata and controls

Project Setup

Grab Your Keys

Tasks

Classification

Transformers

Named Entity Recognition

NER using Azure Text Analytics API

spaCy Pre-trained NER

flairNLP Pre-trained NER

FARM / Transformer Custom NER

Basic approaches

Question Answering

Ranking Algorithm

Potential Extensions

Opinion Mining

OM using Azure Text Analytics API

Project File