Skip to content

Latest commit

 

History

History
142 lines (111 loc) · 13.4 KB

README.md

File metadata and controls

142 lines (111 loc) · 13.4 KB

ci PyPI version fury.io PyPI license PRs Welcome Downloads

TextWiser: Text Featurization Library

TextWiser (AAAI'21) is a research library that provides a unified framework for text featurization based on a rich set of methods while taking advantage of pretrained models provided by the state-of-the-art libraries.

The main contributions include:

  • Rich Set of Embeddings: A wide range of available embeddings and transformations to choose from.
  • Fine-Tuning: Designed to support a PyTorch backend, and hence, retains the ability to fine-tune featurizations for downstream tasks. That means, if you pass the resulting fine-tunable embeddings to a training method, the features will be optimized automatically for your application.
  • Parameter Optimization: Interoperable with the standard scikit-learn pipeline for hyper-parameter tuning and rapid experimentation. All underlying parameters are exposed to the user.
  • Grammar of Embeddings: Introduces a novel approach to design embeddings from components. The compound embedding allows forming arbitrarily complex embeddings in accordance with a context-free grammar that defines a formal language for valid text featurization.
  • GPU Native: Built with GPUs in mind. If it detects available hardware, the relevant models are automatically placed on the GPU.

TextWiser is developed by the Artificial Intelligence Center of Excellence at Fidelity Investments. Documentation is available at fidelity.github.io/textwiser. Here is the video of the paper presentation at AAAI 2021.

Quick Start

# Conceptually, TextWiser is composed of an Embedding, potentially with a pretrained model,
# that can be chained into zero or more Transformations
from textwiser import TextWiser, Embedding, Transformation, WordOptions, PoolOptions

# Data
documents = ["Some document", "More documents. Including multi-sentence documents."]

# Model: TFIDF `min_df` parameter gets passed to sklearn automatically
emb = TextWiser(Embedding.TfIdf(min_df=1))

# Model: TFIDF followed with an NMF + SVD
emb = TextWiser(Embedding.TfIdf(min_df=1), [Transformation.NMF(n_components=30), Transformation.SVD(n_components=10)])

# Model: Word2Vec with no pretraining that learns from the input data
emb = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained=None), Transformation.Pool(pool_option=PoolOptions.min))

# Model: BERT with the pretrained bert-base-uncased embedding
emb = TextWiser(Embedding.Word(word_option=WordOptions.bert), Transformation.Pool(pool_option=PoolOptions.first))

# Features
vecs = emb.fit_transform(documents)

Available Embeddings

Embeddings Notes
Bag of Words (BoW) Supported by scikit-learn
Defaults to training from scratch
Term Frequency Inverse Document Frequency (TfIdf) Supported by scikit-learn
Defaults to training from scratch
Document Embeddings (Doc2Vec) Supported by gensim
Defaults to training from scratch
Universal Sentence Encoder (USE) Supported by tensorflow, see requirements
Defaults to large v5
Compound Embedding Supported by a context-free grammar
Word Embedding: Word2Vec Supported by these pretrained embeddings
Common pretrained options include crawl, glove, extvec, twitter, and en-news
When the pretrained option is None, trains a new model from the given data
Defaults to en, FastText embeddings trained on news
Word Embedding: Character Initialized randomly and not pretrained
Useful when trained for a downstream task
Enable fine-tuning to get good embeddings
Word Embedding: BytePair Supported by these pretrained embeddings
Pretrained options can be specified with the string <lang>_<dim>_<vocab_size>
Default options can be omitted like en, en_100, or en__10000
Defaults to en, which is equal to en_100_10000
Word Embedding: ELMo Supported by these pretrained embeddings from TensorflowHub
Defaults to original
Word Embedding: Flair Supported by these pretrained embeddings
Defaults to news-forward-fast
Word Embedding: BERT Supported by these pretrained embeddings
Defaults to bert-base-uncased
Word Embedding: OpenAI GPT Supported by these pretrained embeddings
Defaults to openai-gpt
Word Embedding: OpenAI GPT2 Supported by these pretrained embeddings
Defaults to gpt2-medium
Word Embedding: TransformerXL Supported by these pretrained embeddings
Defaults to transfo-xl-wt103
Word Embedding: XLNet Supported by these pretrained embeddings
Defaults to xlnet-large-cased
Word Embedding: XLM Supported by these pretrained embeddings
Defaults to xlm-mlm-en-2048
Word Embedding: RoBERTa Supported by these pretrained embeddings
Defaults to roberta-base
Word Embedding: DistilBERT Supported by these pretrained embeddings
Defaults to distilbert-base-uncased
Word Embedding: CTRL Supported by these pretrained embeddings
Defaults to ctrl
Word Embedding: ALBERT Supported by these pretrained embeddings
Defaults to albert-base-v2
Word Embedding: T5 Supported by these pretrained embeddings
Defaults to t5-base
Word Embedding: XLM-RoBERTa Supported by these pretrained embeddings
Defaults to xlm-roberta-base
Word Embedding: BART Supported by these pretrained embeddings
Defaults to facebook/bart-base
Word Embedding: ELECTRA Supported by these pretrained embeddings
Defaults to google/electra-base-generator
Word Embedding: DialoGPT Supported by these pretrained embeddings
Defaults to microsoft/DialoGPT-small
Word Embedding: Longformer Supported by these pretrained embeddings
Defaults to allenai/longformer-base-4096

Available Transformations

Transformations Notes
Singular Value Decomposition (SVD) Differentiable
Latent Dirichlet Allocation (LDA) Not differentiable
Non-negative Matrix Factorization (NMF) Not differentiable
Uniform Manifold Approximation and Projection (UMAP) Not differentiable
Pooling Word Vectors Applies to word embeddings only
Reduces word-level vectors to document-level
Pool options include max, min, mean, first, and last
Defaults to max

Usage Examples

Examples can be found under the notebooks folder.

Installation

TextWiser requires Python 3.8+ and can be installed from PyPI using pip install textwiser, using pip install textwiser[full] to install from PyPI with all optional dependencies, or by building from source by following the instructions in our documentation.

Compound Embedding

A unique research contribution of TextWiser lies in its novel approach in creating embeddings from components, called the Compound Embedding.

This method allows forming arbitrarily complex embeddings, thanks to a context-free grammar that defines a formal language for valid text featurization. You can see the details in our documentation and in the usage example.

Fine-Tuning for Downstream Tasks

All Word2Vec and transformer-based embeddings and any embedding followed with an svd transformation are fine-tunable for downstream tasks. In other words, if you pass the resulting fine-tunable embedding to a PyTorch training method, the features will automatically be trained for your application. You can see the details in our documentation and in the usage example.

Tokenization

In general, text data should be whitespace-tokenized before being fed into TextWiser. Customized tokenization is also supported as described in more detail in our documentation

Support

Please submit bug reports, questions and feature requests as Issues.

Citation

If you use TextWiser in a publication, please cite it as:

  @article{textwiser2021,
    author={Kilitcioglu, Doruk and Kadioglu, Serdar},
    title={Representing the Unification of Text Featurization using a Context-Free Grammar},
    url={https://github.com/fidelity/textwiser},
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
    volume={35},
    number={17},
    year={2021},
    month={May},
    pages={15439-15445}
  }

License

TextWiser is licensed under the Apache License 2.0.