Skip to content

Commit b40b5ad

Browse files
committed
Merge branch 'master' of https://github.com/CODAIT/text-extensions-for-pandas into notebook_review
2 parents fef2807 + 7b04910 commit b40b5ad

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+6178
-49540
lines changed

.readthedocs.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
# Required
66
version: 2
77

8-
# Build documentation in the docs/ directory with Sphinx
8+
# Build documentation in the api_docs/ directory with Sphinx
99
sphinx:
10-
configuration: docs/conf.py
10+
configuration: api_docs/conf.py
1111

1212
# Build documentation with MkDocs
1313
#mkdocs:

.travis.yml

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,34 @@
11
language: python
2-
python:
3-
- "3.7"
42

53
jobs:
64
include:
5+
- python: "3.6"
76
- name: "Pandas 1.0.x"
7+
python: "3.7"
88
env: PANDAS_VERSION=1.0.*
99
- name: "Pandas 1.1.x"
10+
python: "3.7"
1011
env: PANDAS_VERSION=1.1.*
1112

1213
install:
13-
#install conda
14-
- sudo apt-get update
15-
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
16-
- bash miniconda.sh -b -p $HOME/miniconda
17-
- source "$HOME/miniconda/etc/profile.d/conda.sh"
18-
- hash -r
19-
- conda config --set always_yes yes --set changeps1 no
20-
- conda update -q conda
14+
#install conda
15+
- sudo apt-get update
16+
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
17+
- bash miniconda.sh -b -p $HOME/miniconda
18+
- source "$HOME/miniconda/etc/profile.d/conda.sh"
19+
- hash -r
20+
- conda config --set always_yes yes --set changeps1 no
21+
- conda update -q conda
2122

22-
- conda info -a
23+
- conda info -a
2324

24-
- CONDA_HOME="${HOME}/miniconda" ./env.sh
25+
- CONDA_HOME="${HOME}/miniconda" ./env.sh
2526

2627

2728
script:
28-
29-
#activate python virtual environment
30-
- conda activate pd
31-
#check that doc generation is possible
32-
- ./generate_docs.sh
33-
#run unit tests
34-
- pytest -v text_extensions_for_pandas
29+
#activate python virtual environment
30+
- conda activate pd
31+
#check that doc generation is possible
32+
- ./generate_docs.sh
33+
#run unit tests
34+
- pytest -v text_extensions_for_pandas
File renamed without changes.

README.md

Lines changed: 92 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,114 +1,131 @@
1+
12
# Text Extensions for Pandas
3+
4+
[![Documentation Status](https://readthedocs.org/projects/text-extensions-for-pandas/badge/?version=latest)](https://text-extensions-for-pandas.readthedocs.io/en/latest/?badge=latest)
5+
26
Natural language processing support for Pandas dataframes.
37

4-
**This project is under development.** Releases are not yet available.
8+
Text Extensions for Pandas turns Pandas DataFrames into a universal data
9+
structure for representing intermediate data in all phases of your NLP
10+
application development workflow.
511

6-
## Purpose of this Project
12+
## Features
713

8-
Natural language processing (NLP) applications tend to consist of multiple components tied together in a complex pipeline. These components can range from deep parsers and machine learning models to lookup tables and business rules. All of them work by creating and manipulating data structures that represent data about the target text --- things like tokens, entities, parse trees, and so on.
14+
### SpanArray: A Pandas extension type for *spans* of text
915

10-
Libraries for common NLP tasks tend to implement their own custom data structures. They also implement basic low-level operations like filtering and pattern matching over these data structures. For example, `nltk` represents named entities as a list of Python objects:
16+
* Connect features with regions of a document
17+
* Visualize the internal data of your NLP application
18+
* Analyze the accuracy of your models
19+
* Combine the results of multiple models
1120

12-
```python
13-
>>> entities = nltk.chunk.ne_chunk(tagged)
14-
>>> entities
15-
Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'),
16-
('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'),
17-
Tree('PERSON', [('Arthur', 'NNP')]),
18-
('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'),
19-
('very', 'RB'), ('good', 'JJ'), ('.', '.')])
20-
```
21+
### TensorArray: A Pandas extension type for tensors
2122

22-
...while SpaCy represents named entities with the an `Iterable` of `Span` objects:
23+
* Represent BERT embeddings in a Pandas series
24+
* Store logits and other feature vectors in a Pandas series
25+
* Store an entire time series in each cell of a Pandas series
2326

24-
```python
25-
>>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.")
26-
>>> ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
27-
>>> ents
28-
[("eight o'clock", 3, 16, 'TIME'), ('Thursday', 20, 28, 'DATE'), ('morning', 29, 36, 'TIME'), ('Arthur', 38, 44, 'PERSON')]
29-
```
27+
### Pandas front-ends for popular NLP toolkits
28+
29+
* [SpaCy](https://spacy.io/)
30+
* [Transformers](https://github.com/huggingface/transformers)
31+
* [IBM Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding)
32+
* [IBM Watson Discovry Table Understanding](https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-understanding_tables)
33+
34+
35+
## Installation
3036

31-
...or an `Iterable` of `Token` objects with tags:
37+
This library requires Python 3.7+, Pandas, and Numpy.
3238

33-
```python
34-
>>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.")
35-
>>> token_info = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
36-
>>> token_info
37-
[('At', 'O', ''), ('eight', 'B', 'TIME'), ("o'clock", 'I', 'TIME'), ('on', 'O', ''), ('Thursday', 'B', 'DATE'), ('morning', 'B', 'TIME'), (',', 'O', ''), ('Arthur', 'B', 'PERSON'), ('did', 'O', ''), ("n't", 'O', ''), ('feel', 'O', ''), ('very', 'O', ''), ('good', 'O', ''), ('.', 'O', '')]
39+
To install the latest release, just run:
3840
```
41+
pip install text-extensions-for-pandas
42+
```
43+
44+
Depending on your use case, you may also need the following additional
45+
packages:
46+
* `spacy` (for SpaCy support)
47+
* `transformers` (for
48+
* `ibm_watson` (for IBM Watson support)
3949

40-
...and IBM Watson Natural Language Understanding represents named entities as an array of JSON records:
41-
42-
```JSON
43-
{
44-
"entities": [
45-
{
46-
"type": "Person",
47-
"text": "Arthur",
48-
"count": 1,
49-
"confidence": 0.986158
50-
}
51-
]
52-
}
50+
## Installation from Source
51+
52+
If you'd like to try out the very latest version of our code,
53+
you can install directly from the head of the master branch:
54+
```
55+
pip install git+https://github.com/CODAIT/text-extensions-for-pandas
5356
```
5457

55-
This duplication leads to a great deal of redundant work when building NLP applications. Developers need to understand and remember how every component represents every type of data. They need to write code to convert among different representations, and they and need to implement common operations like pattern matching multiple times for different, equivalent data structures.
58+
You can also directly import our package from your local copy of the
59+
`text_extensions_for_pandas` source tree. Just add the root of your local copy
60+
of this repository to the front of `sys.path`.
5661

57-
It is our belief that, with a few targeted improvements, we can make [Pandas](https://pandas.pydata.org/) dataframes into a universal representation for all the data that flows through NLP applications. Such a universal data structure would eliminate redundancy and make application code simpler, faster, and easier to debug.
62+
## Documentation
5863

59-
This project aims to create the extensions that will turn Pandas into this universal data structure. In particular, we plan to add three categories of extension:
64+
For examples of how to use the library, take a look at the notebooks in
65+
[this directory](https://github.com/CODAIT/text-extensions-for-pandas/tree/master/notebooks).
6066

61-
* **New Pandas series types to cover spans and tensors.** These types of data are very important for NLP applications but are cumbersome to represent with "out-of-the-box" Pandas. The new [extensions API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html) that Pandas released in 2019 makes it possible to create performant extension types. We will use this API to add three new series types: CharSpan, TokenSpan (span with token offsets), and Tensor. 
62-
* **An implementation of spanner algebra over Pandas dataframes.** The core operations of the [Document Spanners](https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf)formalism represent tasks that occur repeatedly in NLP applications. Many of these core operations are already present in Pandas. We will create high-performance implementations of the remaining operations over Pandas dataframes. This work will build directly on our Pandas extension types for representing spans.
67+
API documentation can be found at [https://text-extensions-for-pandas.readthedocs.io/en/latest/](https://text-extensions-for-pandas.readthedocs.io/en/latest/)
6368

64-
## Getting Started
6569

66-
### Contents of this repository
70+
## Contents of this repository
6771

6872
* **`text_extensions_for_pandas`**: Source code for the `text_extensions_for_pandas` module.
69-
* **notebooks**: demo notebooks
70-
* **resources**: various input files used by the demo notebooks
71-
* **env.sh**: Script to create an conda environment `pd` capable of running the notebooks in this directory
72-
73-
### Instructions to run a demo notebook
73+
* **env.sh**: Script to create a conda environment `pd` capable of running the notebooks and test cases in this project
74+
* **generate_docs.sh**: Script to build the [API documentation]((https://readthedocs.org/projects/text-extensions-for-pandas/)
75+
* **api_docs**: Configuration files for `generate_docs.sh`
76+
* **config**: Configuration files for `env.sh`.
77+
* **docs**: Project web site
78+
* **notebooks**: example notebooks
79+
* **resources**: various input files used by our example notebooks
80+
* **test_data**: data files for regression tests. The tests themselves are
81+
located adjacent to the library code files.
82+
* **tutorials**: Detailed tutorials on using Text Extensions for Pandas to
83+
cover complex end-to-end NLP use cases (work in progress).
84+
85+
86+
## Instructions to run a demo notebook
7487
1. Check out a copy of this repository
7588
1. (optional) Use the script `env.sh` to set up an Anaconda environment for running the code in this repository.
7689
1. Type `jupyter lab` from the root of your local source tree to start a [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) environment.
77-
1. Navigate to the example notebook `notebooks/Person.ipynb`
90+
1. Navigate to the `notebooks` directory and choose any of the notebooks there
91+
92+
93+
## Contributing
94+
95+
This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM.
7896

79-
### Installation instructions
97+
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request.
8098

81-
We have not yet posted a release of this project, but you can install by
82-
building a `pip` package or by directly importing the contents of the
83-
`text_extensions_for_pandas` source tree.
8499

85-
To build a pip package from your local copy:
86-
1. (optional) Activate the `pd` environment that `env.sh` creates
87-
1. `python3 setup.py sdist bdist_wheel`
88-
1. The package's `.whl` file will appear under the `dist` directory.
100+
## Building and Running Tests
89101

90-
To build and install a pip package from the head of the master branch:
102+
Before building the code in this repository, we recommend that you use the
103+
provided script `env.sh` to set up a consistent build environment:
91104
```
92-
pip install git+https://github.com/CODAIT/text-extensions-for-pandas
105+
$ ./env.sh myenv
106+
$ conda activate myenv
93107
```
108+
(replace `myenv` with your choice of environment name).
94109

95-
To directly import the contents of the `text_extensions_for_pandas` source tree
96-
as a Python package:
97-
1. Add the root directory of your local copy of this repository to the
98-
front of
99-
```python
100-
import text_extensions_for_pandas as tp
110+
To run tests, navigate to the root of your local copy and run:
111+
```
112+
pytest
101113
```
102114

103-
## Contributing
115+
To build pip and source code packages:
104116

105-
This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM.
117+
```
118+
python setup.py sdist bdist_wheel
119+
```
120+
121+
(outputs go into `./dist`).
122+
123+
To build API documentation, run:
124+
125+
```
126+
./generate_docs.sh
127+
```
106128

107-
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request.
108129

109-
## Running Tests
110130

111-
To run regression tests:
112-
1. (optional) Use the script `env.sh` to set up an Anaconda environment
113-
1. Run `python -m unittest discover` from the root of your local copy
114131

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/conf.py renamed to api_docs/conf.py

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -43,20 +43,33 @@
4343
# What Sphinx extensions to activate. If something is not on this list, it
4444
# won't run.
4545
extensions = [
46-
"sphinxcontrib.apidoc",
47-
"sphinx.ext.autodoc", # Needed for the sphinxcontrib.apidoc extension
48-
# "sphinx.ext.coverage",
49-
# "sphinx.ext.napoleon",
50-
# "sphinx.ext.autosummary",
51-
# "sphinx.ext.intersphinx",
46+
"sphinx.ext.autodoc",
47+
"sphinx.ext.coverage",
48+
"sphinx.ext.napoleon",
49+
"sphinx.ext.autosummary",
50+
"sphinx.ext.intersphinx",
51+
52+
# Uncomment the following line to enable full automatic generation of
53+
# API documentation files from code (currently we hard-code an
54+
# entry point for each module and rely on autodoc)
55+
# "sphinxcontrib.apidoc"
5256
]
5357

54-
# Configure the sphinxcontrib.apidoc extension
58+
# Configure the sphinx.ext.autodoc extension
59+
autodoc_default_options = {
60+
# TODO: Re-enable this once readthedocs.org upgrades to a version of
61+
# Sphinx where True is an acceptable value for the "members" option.
62+
# Then remove the redundant :members: and :undoc-members: annotations
63+
# from index.rst.
64+
#"members": True,
65+
#"undoc-members": True,
66+
}
67+
68+
# Configure the sphinxcontrib.apidoc extension (currently not used)
5569
apidoc_module_dir = "../text_extensions_for_pandas"
5670
apidoc_output_dir = "."
57-
apidoc_excluded_paths = ["*test_*"]
58-
apidoc_separate_modules = False
59-
71+
apidoc_excluded_paths = ["test_*.py"]
72+
apidoc_separate_modules = True
6073

6174
# Add any paths that contain templates here, relative to this directory.
6275
templates_path = ['_templates']

0 commit comments

Comments
 (0)