Skip to content

Commit 25b0fa2

Browse files
committed
Wrote a basic README file
1 parent 3281dcd commit 25b0fa2

File tree

2 files changed

+90
-4
lines changed

2 files changed

+90
-4
lines changed

Diff for: CONTRIBUTING.md

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/frreiss/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM.
2+
3+
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request.
4+

Diff for: README.md

+86-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,93 @@
11
# Text Extensions for Pandas
2-
Experimental NLP support for Pandas dataframes.
2+
Natural language processing support for Pandas dataframes.
33

44
**This project is under development.** Releases are not yet available.
55

6-
## Contents of This Directory
6+
## Purpose of this Project
77

8-
* **text_extensions_for_pandas**: Source code for the `text_extensions_for_pandas` module.
8+
Natural language processing (NLP) applications typically consist of a large number of components tied together in a complex pipeline. These components can range from deep parsers and machine learning models to lookup tables and business rules. All of them work by creating and manipulating data structures that represent data about the target text --- things like tokens, entities, parse trees, and so on.
9+
10+
Libraries for common NLP tasks tend to implement their own custom data structures. They also implement and basic low-level operations like filtering and pattern matching over these data structures. For example, `nltk` represents named entities as a list of Python objects:
11+
12+
```python
13+
>>> entities = nltk.chunk.ne_chunk(tagged)
14+
>>> entities
15+
Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'),
16+
('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'),
17+
Tree('PERSON', [('Arthur', 'NNP')]),
18+
('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'),
19+
('very', 'RB'), ('good', 'JJ'), ('.', '.')])
20+
```
21+
22+
...while SpaCy represents named entities with the an `Iterable` of `Span` objects:
23+
24+
```python
25+
>>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.")
26+
>>> ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
27+
>>> ents
28+
[("eight o'clock", 3, 16, 'TIME'), ('Thursday', 20, 28, 'DATE'), ('morning', 29, 36, 'TIME'), ('Arthur', 38, 44, 'PERSON')]
29+
```
30+
31+
...or an `Iterable` of `Token` objects with tags:
32+
33+
```python
34+
>>> doc = nlp("At eight o'clock on Thursday morning, Arthur didn't feel very good.")
35+
>>> token_info = [(t.text, t.ent_iob_, t.ent_type_) for t in doc]
36+
>>> token_info
37+
[('At', 'O', ''), ('eight', 'B', 'TIME'), ("o'clock", 'I', 'TIME'), ('on', 'O', ''), ('Thursday', 'B', 'DATE'), ('morning', 'B', 'TIME'), (',', 'O', ''), ('Arthur', 'B', 'PERSON'), ('did', 'O', ''), ("n't", 'O', ''), ('feel', 'O', ''), ('very', 'O', ''), ('good', 'O', ''), ('.', 'O', '')]
38+
```
39+
40+
...and IBM Watson Natural Language Understanding represents named entities as an array of JSON records:
41+
42+
```JSON
43+
{
44+
"entities": [
45+
{
46+
"type": "Person",
47+
"text": "Arthur",
48+
"count": 1,
49+
"confidence": 0.986158
50+
}
51+
]
52+
}
53+
```
54+
55+
This duplication leads to a great deal of redundant work when building NLP applications. Developers need to understand and remember how every component represents every type of data. They need to write code to convert among different representations, and they and need to implement common operations like pattern matching multiple times for different, equivalent data structures.
56+
57+
It is our belief that, with a few targeted improvements, we can make [Pandas](https://pandas.pydata.org/) dataframes into a universal representation for all the data that flows through NLP applications. Such a universal data structure would eliminate redundancy and make application code simpler, faster, and easier to debug.
58+
59+
This project aims to create the extensions that will turn Pandas into this universal data structure. In particular, we plan to add three categories of extension:
60+
61+
* **New Pandas series types to cover spans and tensors.** These types of data are very important for NLP applications but are cumbersome to represent with "out-of-the-box" Pandas. The new extensions API that Pandas released in 2019 makes it possible to create performant extension types. We will use this API to add three new series types: CharSpan, TokenSpan (span with token offsets), and Tensor. 
62+
* **An implementation of spanner algebra over Pandas dataframes.** The core operations of the [Document Spanners](https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf)formalism represent tasks that occur repeatedly in NLP applications. Many of these core operations are already present in Pandas. We will create high-performance implementations of the remaining operations over Pandas dataframes. This work will build directly on our Pandas extension types for representing spans.
63+
* **An implementation of the Gremlin graph query language over Pandas dataframes.** As one of the most widely used graph query languages, [Gremlin](https://tinkerpop.apache.org/gremlin.html) is a natural choice for NLP tasks that involve parse trees and knowledge graphs. There are many graph database systems that support Gremlin, including Apache TinkerPop, JanusGraph, Neo4J, Amazon Neptune, Azure CosmosDB, Apache Spark, and Apache Giraph. However, using Gremlin in Python programs is difficult today, as the Python support of existing Gremlin providers is generally weak. We will create an embedded Gremlin engine that operates directly over Pandas dataframes. This embedded engine will give NLP developers the power of a graph query language without having to manage an external graph database.
64+
65+
## Getting Started
66+
67+
### Contents of this repository
68+
69+
* **`text_extensions_for_pandas`**: Source code for the `text_extensions_for_pandas` module.
970
* **notebooks**: demo notebooks
1071
* **resources**: various input files used by the demo notebooks
11-
* **env.sh**: Script to create an conda environment `pd` capable of running the notebooks in this directory
72+
* **env.sh**: Script to create an conda environment `pd` capable of running the notebooks in this directory
73+
74+
### Instructions to run a demo notebook
75+
1. Check out a copy of this repository
76+
1. (optional) Use the script `env.sh` to set up an Anaconda environment for running the code in this repository.
77+
1. Type `jupyter lab` from the root of your local source tree to start a [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) environment.
78+
1. Navigate to the example notebook `notebooks/Person.ipynb`
79+
80+
### Installation instructions
81+
82+
We have not yet implemented scripts to build `pip` packages, but you can directly import the contents of the `text_extensions_for_pandas` source tree as a Python package:
83+
84+
```python
85+
import text_extensions_for_pandas as tp
86+
```
87+
88+
## Contributing
89+
90+
This project is an IBM open source project. We are developing the code in the open under the [Apache License](https://github.com/frreiss/text-extensions-for-pandas/blob/master/LICENSE), and we welcome contributions from both inside and outside IBM.
91+
92+
To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the [Developer's Certificate of Origin 1.1](https://elinux.org/Developer_Certificate_Of_Origin) along with your pull request.
93+

0 commit comments

Comments
 (0)