|
| 1 | + |
1 | 2 | # Text Extensions for Pandas
|
| 3 | + |
| 4 | +[](https://text-extensions-for-pandas.readthedocs.io/en/latest/?badge=latest) |
| 5 | + |
2 | 6 | Natural language processing support for Pandas dataframes.
|
3 | 7 |
|
4 |
| -**This project is under development.** Releases are not yet available. |
| 8 | +Text Extensions for Pandas turns Pandas DataFrames into a universal data |
| 9 | +structure for representing intermediate data in all phases of your NLP |
| 10 | +application development workflow. |
5 | 11 |
|
6 |
| -## Purpose of this Project |
| 12 | +## Features |
7 | 13 |
|
8 |
| -Natural language processing (NLP) applications tend to consist of multiple components tied together in a complex pipeline. These components can range from deep parsers and machine learning models to lookup tables and business rules. All of them work by creating and manipulating data structures that represent data about the target text --- things like tokens, entities, parse trees, and so on. |
| 14 | +### SpanArray: A Pandas extension type for *spans* of text |
9 | 15 |
|
10 |
| -Libraries for common NLP tasks tend to implement their own custom data structures. They also implement basic low-level operations like filtering and pattern matching over these data structures. For example, `nltk` represents named entities as a list of Python objects: |
| 16 | +* Connect features with regions of a document |
| 17 | +* Visualize the internal data of your NLP application |
| 18 | +* Analyze the accuracy of your models |
| 19 | +* Combine the results of multiple models |
11 | 20 |
|
12 |
| -```python |
13 |
| ->>> entities = nltk.chunk.ne_chunk(tagged) |
14 |
| ->>> entities |
15 |
| -Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), |
16 |
| - ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), |
17 |
| - Tree('PERSON', [('Arthur', 'NNP')]), |
18 |
| - ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), |
19 |
| - ('very', 'RB'), ('good', 'JJ'), ('.', '.')]) |
20 |
| -``` |
| 21 | +### TensorArray: A Pandas extension type for tensors |
21 | 22 |
|
22 |
| -...while SpaCy represents named entities with the an `Iterable` of `Span` objects: |
| 23 | +* Represent BERT embeddings in a Pandas series |
| 24 | +* Store logits and other feature vectors in a Pandas series |
| 25 | +* Store an entire time series in each cell of a Pandas series |
23 | 26 |
|
24 |
| -```python |
25 |
| ->>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.") |
26 |
| ->>> ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] |
27 |
| ->>> ents |
28 |
| -[("eight o'clock", 3, 16, 'TIME'), ('Thursday', 20, 28, 'DATE'), ('morning', 29, 36, 'TIME'), ('Arthur', 38, 44, 'PERSON')] |
29 |
| -``` |
| 27 | +### Pandas front-ends for popular NLP toolkits |
| 28 | + |
| 29 | +* [SpaCy](https://spacy.io/) |
| 30 | +* [Transformers](https://github.com/huggingface/transformers) |
| 31 | +* [IBM Watson Natural Language Understanding](https://www.ibm.com/cloud/watson-natural-language-understanding) |
| 32 | +* [IBM Watson Discovry Table Understanding](https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-understanding_tables) |
| 33 | + |
| 34 | + |
| 35 | +## Installation |
30 | 36 |
|
31 |
| -...or an `Iterable` of `Token` objects with tags: |
| 37 | +This library requires Python 3.7+, Pandas, and Numpy. |
32 | 38 |
|
33 |
| -```python |
34 |
| ->>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.") |
35 |
| ->>> token_info = [(t.text, t.ent_iob_, t.ent_type_) for t in doc] |
36 |
| ->>> token_info |
37 |
| -[('At', 'O', ''), ('eight', 'B', 'TIME'), ("o'clock", 'I', 'TIME'), ('on', 'O', ''), ('Thursday', 'B', 'DATE'), ('morning', 'B', 'TIME'), (',', 'O', ''), ('Arthur', 'B', 'PERSON'), ('did', 'O', ''), ("n't", 'O', ''), ('feel', 'O', ''), ('very', 'O', ''), ('good', 'O', ''), ('.', 'O', '')] |
| 39 | +To install the latest release, just run: |
38 | 40 | ```
|
| 41 | +pip install text-extensions-for-pandas |
| 42 | +``` |
| 43 | + |
| 44 | +Depending on your use case, you may also need the following additional |
| 45 | +packages: |
| 46 | +* `spacy` (for SpaCy support) |
| 47 | +* `transformers` (for |
| 48 | +* `ibm_watson` (for IBM Watson support) |
39 | 49 |
|
40 |
| -...and IBM Watson Natural Language Understanding represents named entities as an array of JSON records: |
41 |
| - |
42 |
| -```JSON |
43 |
| -{ |
44 |
| - "entities": [ |
45 |
| - { |
46 |
| - "type": "Person", |
47 |
| - "text": "Arthur", |
48 |
| - "count": 1, |
49 |
| - "confidence": 0.986158 |
50 |
| - } |
51 |
| - ] |
52 |
| -} |
| 50 | +## Installation from Source |
| 51 | + |
| 52 | +If you'd like to try out the very latest version of our code, |
| 53 | +you can install directly from the head of the master branch: |
| 54 | +``` |
| 55 | +pip install git+https://github.com/CODAIT/text-extensions-for-pandas |
53 | 56 | ```
|
54 | 57 |
|
55 |
| -This duplication leads to a great deal of redundant work when building NLP applications. Developers need to understand and remember how every component represents every type of data. They need to write code to convert among different representations, and they and need to implement common operations like pattern matching multiple times for different, equivalent data structures. |
| 58 | +You can also directly import our package from your local copy of the |
| 59 | +`text_extensions_for_pandas` source tree. Just add the root of your local copy |
| 60 | +of this repository to the front of `sys.path`. |
56 | 61 |
|
57 |
| -It is our belief that, with a few targeted improvements, we can make [Pandas](https://pandas.pydata.org/) dataframes into a universal representation for all the data that flows through NLP applications. Such a universal data structure would eliminate redundancy and make application code simpler, faster, and easier to debug. |
| 62 | +## Documentation |
58 | 63 |
|
59 |
| -This project aims to create the extensions that will turn Pandas into this universal data structure. In particular, we plan to add three categories of extension: |
| 64 | +For examples of how to use the library, take a look at the notebooks in |
| 65 | +[this directory](https://github.com/CODAIT/text-extensions-for-pandas/tree/master/notebooks). |
60 | 66 |
|
61 |
| -* **New Pandas series types to cover spans and tensors.** These types of data are very important for NLP applications but are cumbersome to represent with "out-of-the-box" Pandas. The new [extensions API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html) that Pandas released in 2019 makes it possible to create performant extension types. We will use this API to add three new series types: CharSpan, TokenSpan (span with token offsets), and Tensor. |
62 |
| -* **An implementation of spanner algebra over Pandas dataframes.** The core operations of the [Document Spanners](https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf)formalism represent tasks that occur repeatedly in NLP applications. Many of these core operations are already present in Pandas. We will create high-performance implementations of the remaining operations over Pandas dataframes. This work will build directly on our Pandas extension types for representing spans. |
| 67 | +API documentation can be found at [https://text-extensions-for-pandas.readthedocs.io/en/latest/](https://text-extensions-for-pandas.readthedocs.io/en/latest/) |
63 | 68 |
|
64 |
| -## Getting Started |
65 | 69 |
|
66 |
| -### Contents of this repository |
| 70 | +## Contents of this repository |
67 | 71 |
|
68 | 72 | * **`text_extensions_for_pandas`**: Source code for the `text_extensions_for_pandas` module.
|
69 |
| -* **notebooks**: demo notebooks |
70 |
| -* **resources**: various input files used by the demo notebooks |
71 |
| -* **env.sh**: Script to create an conda environment `pd` capable of running the notebooks in this directory |
72 |
| - |
73 |
| -### Instructions to run a demo notebook |
| 73 | +* **env.sh**: Script to create a conda environment `pd` capable of running the notebooks and test cases in this project |
| 74 | +* **generate_docs.sh**: Script to build the [API documentation]((https://readthedocs.org/projects/text-extensions-for-pandas/) |
| 75 | +* **api_docs**: Configuration files for `generate_docs.sh` |
| 76 | +* **config**: Configuration files for `env.sh`. |
| 77 | +* **docs**: Project web site |
| 78 | +* **notebooks**: example notebooks |
| 79 | +* **resources**: various input files used by our example notebooks |
| 80 | +* **test_data**: data files for regression tests. The tests themselves are |
| 81 | + located adjacent to the library code files. |
| 82 | +* **tutorials**: Detailed tutorials on using Text Extensions for Pandas to |
| 83 | + cover complex end-to-end NLP use cases (work in progress). |
| 84 | + |
| 85 | + |
| 86 | +## Instructions to run a demo notebook |
74 | 87 | 1. Check out a copy of this repository
|
75 | 88 | 1. (optional) Use the script `env.sh` to set up an Anaconda environment for running the code in this repository.
|
76 | 89 | 1. Type `jupyter lab` from the root of your local source tree to start a [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) environment.
|
77 |
| -1. Navigate to the example notebook `notebooks/Person.ipynb` |
| 90 | +1. Navigate to the `notebooks` directory and choose any of the notebooks there |
| 91 | + |
| 92 | + |
| 93 | +## Contributing |
| 94 | + |
| 95 | +This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM. |
78 | 96 |
|
79 |
| -### Installation instructions |
| 97 | +To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request. |
80 | 98 |
|
81 |
| -We have not yet posted a release of this project, but you can install by |
82 |
| -building a `pip` package or by directly importing the contents of the |
83 |
| -`text_extensions_for_pandas` source tree. |
84 | 99 |
|
85 |
| -To build a pip package from your local copy: |
86 |
| -1. (optional) Activate the `pd` environment that `env.sh` creates |
87 |
| -1. `python3 setup.py sdist bdist_wheel` |
88 |
| -1. The package's `.whl` file will appear under the `dist` directory. |
| 100 | +## Building and Running Tests |
89 | 101 |
|
90 |
| -To build and install a pip package from the head of the master branch: |
| 102 | +Before building the code in this repository, we recommend that you use the |
| 103 | +provided script `env.sh` to set up a consistent build environment: |
91 | 104 | ```
|
92 |
| -pip install git+https://github.com/CODAIT/text-extensions-for-pandas |
| 105 | +$ ./env.sh myenv |
| 106 | +$ conda activate myenv |
93 | 107 | ```
|
| 108 | +(replace `myenv` with your choice of environment name). |
94 | 109 |
|
95 |
| -To directly import the contents of the `text_extensions_for_pandas` source tree |
96 |
| -as a Python package: |
97 |
| -1. Add the root directory of your local copy of this repository to the |
98 |
| - front of |
99 |
| -```python |
100 |
| -import text_extensions_for_pandas as tp |
| 110 | +To run tests, navigate to the root of your local copy and run: |
| 111 | +``` |
| 112 | +pytest |
101 | 113 | ```
|
102 | 114 |
|
103 |
| -## Contributing |
| 115 | +To build pip and source code packages: |
104 | 116 |
|
105 |
| -This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/CODAIT/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM. |
| 117 | +``` |
| 118 | +python setup.py sdist bdist_wheel |
| 119 | +``` |
| 120 | + |
| 121 | +(outputs go into `./dist`). |
| 122 | + |
| 123 | +To build API documentation, run: |
| 124 | + |
| 125 | +``` |
| 126 | +./generate_docs.sh |
| 127 | +``` |
106 | 128 |
|
107 |
| -To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request. |
108 | 129 |
|
109 |
| -## Running Tests |
110 | 130 |
|
111 |
| -To run regression tests: |
112 |
| -1. (optional) Use the script `env.sh` to set up an Anaconda environment |
113 |
| -1. Run `python -m unittest discover` from the root of your local copy |
114 | 131 |
|
0 commit comments