Skip to content
This repository was archived by the owner on Jun 1, 2023. It is now read-only.

Commit 5baf493

Browse files
committed
Merge branch 'master' of https://github.com/kevinlu1248/pyate
2 parents 51843a7 + 28d6e4d commit 5baf493

6 files changed

+20
-39
lines changed

.deepsource.toml

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
version = 1
2+
3+
test_patterns = ["**.py"]
4+
5+
[[analyzers]]
6+
name = "python"
7+
enabled = true
8+
9+
[analyzers.meta]
10+
runtime_version = "3.x.x"

README.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ pip install pyate https://github.com/explosion/spacy-models/releases/download/en
2020
```
2121

2222
## :rocket: Quickstart
23-
To get started, simply call one of the implemented algorithms. According to Astrakhantsev 2016, `combo_basic` is the most precise of the five algorithms, though `basic` and `cvalue` is not too far behind (see Precision). The same study shows that PU-ATR and KeyConceptRel have higher precision than `combo_basic` but are not implemented and PU-ATR take significantly more time since it uses machine learning.
23+
To get started, simply call one of the implemented algorithms. According to Astrakhantsev 2016, `combo_basic` is the most precise of the five algorithms, though `basic` and `cvalues` is not too far behind (see Precision). The same study shows that PU-ATR and KeyConceptRel have higher precision than `combo_basic` but are not implemented and PU-ATR take significantly more time since it uses machine learning.
2424
```python3
2525
from pyate import combo_basic
2626

@@ -88,7 +88,7 @@ __init__(
8888
where `func` is essentially your term extracting algorithm that takes in a corpus (either a string or iterator of strings) and outputs a Pandas Series of term-value pairs of terms and their respective termhoods. `func` is by default `combo_basic`. `args` and `kwargs` are for you to overide default values for the function, which you can find by running `help` (might document later on).
8989

9090
### Summary of functions
91-
Each of `cvalue, basic, combo_basic, weirdness` and `term_extractor` take in a string or an iterator of strings and outputs a Pandas Series of term-value pairs, where higher values indicate higher chance of being a domain specific term. Furthermore, `weirdness` and `term_extractor` take a `general_corpus` key word argument which must be an iterator of strings which defaults to the General Corpus described below.
91+
Each of `cvalues, basic, combo_basic, weirdness` and `term_extractor` take in a string or an iterator of strings and outputs a Pandas Series of term-value pairs, where higher values indicate higher chance of being a domain specific term. Furthermore, `weirdness` and `term_extractor` take a `general_corpus` key word argument which must be an iterator of strings which defaults to the General Corpus described below.
9292

9393
All functions only take the string of which you would like to extract terms from as the mandatory input (the `technical_corpus`), as well as other tweakable settings, including `general_corpus` (contrasting corpus for `weirdness` and `term_extractor`), `general_corpus_size`, `verbose` (whether to print a progress bar), `weights`, `smoothing`, `have_single_word` (whether to have a single word count as a phrase) and `threshold`. If you have not read the papers and are unfamiliar with the algorithms, I recommend just using the default settings. Again, use `help` to find the details regarding each algorithm since they are all different.
9494

@@ -117,7 +117,7 @@ Here is the average precision of some of the implemented algorithms using the Av
117117
## :stars: Motivation
118118
This project was planned to be a tool to be connected to a Google Chrome Extension that highlights and defines key terms that the reader probably does not know of. Furthermore, term extraction is an area where there is not a lot of focused research on in comparison to other areas of NLP and especially recently is not viewed to be very practical due to the more general tool of NER tagging. However, modern NER tagging usually incorporates some combination of memorized words and deep learning which are spatially and computationally heavy. Furthermore, to generalize an algorithm to recognize terms to the ever growing areas of medical and AI research, a list of memorized words will not do.
119119

120-
Of the five implemented algorithms, none are expensive, in fact, the bottleneck of the space allocation and computation expense is from the spaCy model and spaCy POS tagging. This is because they most rely simply on POS patterns, word frequencies, and the existence of embedded term candidates. For example, the term candidate "breast cancer" implies that "malignant breast cancer" is probably not a term and simply a form of "breast cancer" that is "malignant" (implemented in C-Value).
120+
Of the five implemented algorithms, none are expensive, in fact, the bottleneck of the space allocation and computation expense is from the spaCy model and spaCy POS tagging. This is because they mostly rely simply on POS patterns, word frequencies, and the existence of embedded term candidates. For example, the term candidate "breast cancer" implies that "malignant breast cancer" is probably not a term and simply a form of "breast cancer" that is "malignant" (implemented in C-Value).
121121

122122
## :pushpin: Todo
123123
* Add PU-ATR algorithm since its precision is a lot higher, though more computationally expensive

src/pyate/cvalues.py

-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
# c_value
22

3-
import time
43
import math
54
from typing import List, Mapping
65

src/pyate/term_extraction.py

+7-25
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,23 @@
1-
# c_value
1+
# term_extraction.py
22

3-
import pickle
4-
import time
5-
import math
6-
from collections import Iterable
3+
import collections
4+
from collections import defaultdict
5+
import pkg_resources
76
from multiprocessing import Pool
87
from typing import Iterable, Union, Sequence, Callable
9-
from distutils.sysconfig import get_python_lib
108

119
import spacy
1210
from spacy.matcher import Matcher
1311
from tqdm import tqdm
1412
import pandas as pd
15-
from collections import defaultdict
1613
import ahocorasick
1714
import numpy as np
18-
import pkg_resources
1915

2016
start_ = 0
2117
tmp = 0
2218
doctime, matchertime = 0, 0
2319
Corpus = Union[str, Sequence[str]]
2420

25-
# import glob
26-
# print(get_python_lib())
27-
# print(glob.glob("/home/kevin/PycharmProjects/pyate/venv/lib/python3.6/site-packages/*.csv"))
28-
29-
30-
def start():
31-
global start_
32-
start_ = time.time()
33-
34-
35-
def end():
36-
global start_
37-
print(time.time() - start_)
38-
3921

4022
class TermExtraction:
4123
nlp = spacy.load("en_core_web_sm", parser=False, entity=False)
@@ -140,7 +122,7 @@ def count_terms_from_documents(self, seperate: bool = False, verbose: bool = Fal
140122
self.__term_counts = pd.Series(self.count_terms_from_document(self.corpus))
141123
return self.__term_counts
142124
# elif type(self.corpus) is list or type(self.corpus) is pd.Series:
143-
elif isinstance(self.corpus, Iterable):
125+
elif isinstance(self.corpus, collections.Iterable):
144126
if seperate:
145127
term_counters = []
146128
else:
@@ -224,10 +206,10 @@ def term_extraction_decoration(self, *args, **kwargs):
224206
wiki = pd.read_pickle(PATH_TO_GENERAL_DOMAIN)
225207
pmc = pd.read_pickle(PATH_TO_TECHNICAL_DOMAIN)
226208
vocab = ["Cutaneous melanoma", "cancer", "secondary clusters", "bio"]
227-
start()
209+
# start()
228210
print(
229211
TermExtraction(pmc[:100]).count_terms_from_documents(
230212
seperate=True, verbose=True
231213
)
232214
)
233-
end()
215+
# end()

src/pyate/term_extractor.py

-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
11
# c_value
22

3-
import time
43
import math
54
from typing import Mapping, Sequence
6-
7-
import spacy
8-
import pickle
95
import pandas as pd
106
import numpy as np
117

src/pyate/weirdness.py

-6
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,6 @@
11
# weirdness.py
22

3-
import pickle
4-
import time
5-
import math
6-
import json
73
from typing import Mapping
8-
9-
import spacy
104
import pandas as pd
115

126
from .term_extraction import TermExtraction, add_term_extraction_method, Corpus

0 commit comments

Comments
 (0)