This page describes how you get and set the right Azure resource keys to make Verseagility work end-to-end.
- In the root directory of the Verseagility repository, you will find a file named
config.sample.ini
, which has the following content:
[environ]
aml-ws-name=
aml-ws-sid=
aml-ws-rg=
text-analytics-name=
text-analytics-key=
cosmos-db-name=
cosmos-db-key=
-
Create a copy of this file and name it
config.ini
. Store it in the same folder as theconfig.sample.ini
-
Go to your newly created resource group in your Azure portal and enter following resources one after another:
-
Azure Machine Learning
- The main page of Azure Machine Learning has all the information you need, starting with the name of your resource, which you have to insert in the
aml-ws-name
. - Secondly, you have to insert your subscription ID as stated in the screenshot below in the second row
aml-ws-sid
. - Also, you have to insert the name of your resource group in the line
aml-ws-rg
.
- The main page of Azure Machine Learning has all the information you need, starting with the name of your resource, which you have to insert in the
-
Text Analytics Service
-
At the main page called Quick start, copy the following elements:
-
Copy Key1 and insert it in the
text-analytics-key
row of yourini
-file. -
Mark and copy the Endpoint name of the text analytics service as stated in the frame and insert it in the
text-analytics-name
of your file. You do not need the whole URL, the name is sufficient. -
Please verify with the icon you see below that you are actually using the Text Analytics resource instead of the Computer Vision.
-
-
Cosmos DB
- After entering your Cosmos DB resource, click on Keys in the left menu.
- From there, copy the resource name from the URI-field, which matches to the name of the resource. Similarly to the Text Analytics Service, you do not need the entire URL, the name is sufficient. Insert it in the
cosmos-db-name
of your file. - Last but not least, copy the Primary Key and set it in the
cosmos-db-key
row.
- Your file should look similarly to this one (the values below are random):
[environ]
aml-ws-name=vers-aml-workspace-ikkavgc641vq2
aml-ws-sid=1dca2144-0815-3301-b6a5-8a97ef7632a5
aml-ws-rg=verseagility
text-analytics-name=verstaxhksd72s
text-analytics-key=9b82deffedca46bd9b4938cd8029a355
cosmos-db-name=verscosmosiqkkxah120vq2
cosmos-db-key=XH0mse9II5wYaXJMFb5sycyDcaAWwATwJTAdvVhBD18QdQaYsZe23mupgD378VZW751yHP6v4YbOZitgxSXSg==
As described before, currently the following tasks are supported:
- Text/Document classification
- Named Entity Recognition
- Question Answering The section below covers briefly what they consist of, which dependencies they have and how you can customize them further.
This section describes which models are used to train classification models. Both multi-class and multi-label approaches are supported and facilitated by the FARM framework. We primarily use so-called Transformer models to train classification assets.
- Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.
- Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on your own datasets then share them with the community on our model hub. At the same time, each Python module defining an architecture can be used as a standalone and modified to enable quick research experiments.
- Transformers is backed by the two most popular deep learning libraries, PyTorch and TensorFlow, with a seamless integration between them, allowing you to train your models with one then load it for inference with the other. In this setup, we use PyTorch.
The models used are defined in src/helper.py
and the dictionary below can be extended by other model names and languages. The list of pretrained models for many purposes can be found on HuggingFace.
farm_model_lookup = {
'bert': {
'xx':'bert-base-multilingual-cased',
'en':'bert-base-cased',
'de':'bert-base-german-cased',
'fr':'camembert-base',
'cn':'bert-base-chinese'
},
'roberta' : {
'en' : 'roberta-base',
'de' : 'roberta-base',
'fr' : 'roberta-base',
'es' : 'roberta-base',
'it' : 'roberta-base'
# All languages for roberta because of multi_classificaiton
},
'xlm-roberta' : {
'xx' : 'xlm-roberta-base'
},
'albert' : {
'en' : 'albert-base-v2'
},
'distilbert' : {
'xx' : 'distilbert-base-multilingual-cased',
'de' : 'distilbert-base-german-cased'
}
}
The toolkit supports and includes different approaches and frameworks for recognizing relevant entities in text paragraphs, called Named Entity Recognition, short NER:
- Azure Text Analytics API
- spaCy Pre-trained NER
- flairNLP Pre-Trained NER
- FARM/Transformer Custom NER
- Basic approaches (like regex and term lists)
The central components can be found in the script code/ner.py
.
Azure Text Analytics is a Cognitive Service providing an API-based extraction of relevant entities from texts. You can find the documentation for Azure Text Analytics API here. In order to use it within the NLP toolkit, you need to set up a Cognitive Service in your personal Azure subscription and insert the relevant subscription keys. A basic service is free, yet it has request limitations. You can find a description how to set your keys if you scroll up to the top of the page.
We leverage the pre-trained models from spaCy, which is an open source framework to speed up your NER tasks. There is a language specific model referenced in src/helper.py
. Besides the models, we leverage language pipelines by spaCy, also to be able to process multiple models.
Flair is a framework for state-of-the-art NLP tasks, especially NER. You can leverage powerful, pre-trained models and deploy them to your endpoint. Depending on your use case, you can choose different models and reference it in the lookup file in the src/helper.py
.
The most basic approach of Named Entity Recognition in text files is to take use of lists or regular expressions (regex). You can add your own terms to the toolkit scope by following these steps:
-
Go to the subfolder
assets/
of the repository folder -
You find two relevant files in there:
names.txt
- stop word list
ner.txt
- value-key-pairs of terms and their category, tab-delimited file
-
Open them in a text editor of your choice and add further value-key pairs to
ner.txt
or continue to extend the list of stop words innames.txt
. Stop words are words which are filtered out before or after processing as they are too common and frequent for bringing value to the analysis. -
Make sure the values are all lower-case, but the keys should be properly formatted
This section is devoted to the question-answering component of the NLP toolkit and describes how the answer suggestions are being ranked during runtime. Please keep in mind that this component of the toolkit requires a large amount of potential answers for each text that has been trained along with the input texts in order to
The current version of Verseagility supports the Okapi BM25 information retrieval algorithm to sort historical question answer pairs by relevance. BM25 is a ranking approach used by search engines to estimate the relevance of a document to a given search query, such as a text or document. This is implemented using the gensim library. The ranking framework is accessed by the file code/rank.py
.
Due to the modular setup, this section can be extended to support the QNAMaker from Microsoft, or custom question answering algorithms using Transformers and FARM. Support for these may be added in coming versions of Verseagility.
The section below describes how you can enrich your Verseagility API with an opinion mining feature. Opinion mining is a feature of sentiment analysis. Also known as aspect-based sentiment analysis in Natural Language Processing (NLP), this feature provides more granular information about the opinions related to words (such as the attributes of products or services) in text.
Verseagility now supports the Opinion Mining feature of Azure Cognitive Services. All you need to bring is an API key from an Azure Text Analytics resource. Please see the Language Support page for information, whether your desired language is supported.
- Create a config file for your end-to-end Verseagility-project in the subfolder project/ (located in the root folder of the repository). We recommend you to follow this naming convention:
[name of your project]\_[language code, two letters].config.json
->msforum_en.config.json
See the following json-snippet as an example:
{
"name":"msforum_en",
"language": "en",
"environment" : "dev",
"data_dir" : "./run/",
"prepare" : {
"data_type" : "json"
},
"tasks": {
"1": {
"label": "subcat",
"type": "classification",
"model_type": "roberta",
"max_seq_len": 256,
"embeds_dropout":0.3,
"learning_rate":3e-5,
"prepare": true
},
"2": {
"type": "multi_classification",
"model_type": "roberta",
"max_seq_len": 256,
"embeds_dropout":0.3,
"learning_rate":2e-5,
"prepare": true
},
"3": {
"type": "ner",
"prepare": false,
"models": ["textanalytics", "flair", "custom", "regex", "nerlist"]
},
"4": {
"type": "qa",
"model_type": "historical",
"prepare": true
},
"5": {
"type": "om",
"prepare": false
}
},
"deploy": {
"type": "ACI",
"memory": 2,
"cpu": 1
}
}
You see that there are multiple tasks in the project file. To help you understand how potential changes could look like, use following examples:
- If you only want to go for classification, keep task 1 in the config
- You would like to use a different, pre-trained BERT model in task 1 or 2? Adjust this one respectively
"model_type": "roberta"
- In case you do not want to integrate Multi Label Classification, Named Entity Recognition, Question/Answering, Opinion Mining..., simply remove it from your JSON
- If you only want specific NER models to be considered when scoring a model, adjust the array of models
["textanalytics", "flair", "custom", "regex", "nerlist"]
and only keep the ones you would like to use in the array.
See following example file below, only covering classification (task 1) and NER (task 3), only with Flair and RegEx.
{
"name":"msforum_en",
"language": "en",
"environment" : "dev",
"data_dir" : "./run/",
"prepare" : {
"data_type" : "json"
},
"tasks": {
"1": {
"label": "subcat",
"type": "classification",
"model_type": "roberta",
"max_seq_len": 256,
"embeds_dropout":0.3,
"learning_rate":3e-5,
"prepare": true
},
"3": {
"type": "ner",
"prepare": false,
"models": ["flair", "regex"]
}
},
"deploy": {
"type": "ACI",
"memory": 2,
"cpu": 1
}
}
- After creating the json file, you need to do a slight change in the
custom.py
script:
- Look for this line within the script:
params = he.get_project_config('[INSERT CONFIG NAME HERE]')
- Insert the file name of the JSON (e.g.
msforum_en.config.json
) you just created in the first step of this section. You do not need to pass a folder name as long as the file is located inproject/
, which we highly recommend. The line of your code should look like this:
params = he.get_project_config('msforum_en.config.json')
Your project is now set up! Next, we will explain the Data Cleaning Steps.