Skip to content

Commit 0c87554

Browse files
authored
Merge pull request #3 from BeastByteAI/ner-docs
ner-docs
2 parents 3e5057e + cf71d8b commit 0c87554

File tree

9 files changed

+131
-699
lines changed

9 files changed

+131
-699
lines changed

.github/workflows/deploy.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ jobs:
3333
with:
3434
github_token: ${{ secrets.GH_ACCESS_TOKEN }}
3535
publish_dir: ./out
36+
cname: skllm.beastbyte.ai

src/app/docs/ner/page.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
title: Named Entity Recognition
3+
nextjs:
4+
metadata:
5+
title: Named Entity Recognition
6+
description: Learn about NER.
7+
---
8+
9+
## Overview
10+
11+
{% callout title="Warning" type="warning" %}
12+
Named Entity Recognition is an experimental feature and may be subject to instability. Please be aware that the API and/or functionality could change.
13+
{% /callout %}
14+
15+
Named Entity Recognition is a process of locating and classifying the named entities in a provided text.
16+
17+
18+
Currently, Scikit-LLM has a single NER estimator (only works with the GPT family) called `Explainable NER`.
19+
20+
Exemplary usage:
21+
22+
```python
23+
from skllm.models.gpt.tagging.ner import GPTExplainableNER as NER
24+
25+
entities = {
26+
"PERSON": "A name of an individual.",
27+
"ORGANIZATION": "A name of a company.",
28+
"DATE": "A specific time reference."
29+
}
30+
31+
data = [
32+
"Tim Cook announced new Apple products in San Francisco on June 3, 2022.",
33+
"Elon Musk visited the Tesla factory in Austin on January 10, 2021.",
34+
"Mark Zuckerberg introduced Facebook Metaverse in Silicon Valley on May 5, 2023."
35+
]
36+
37+
ner = NER(entities=entities, display_predictions=True)
38+
tagged = ner.fit_transform(data)
39+
```
40+
41+
The model will tag the entities and provide a short reasoning behind its choice. If the `display_predictions` output is set to `True`, the outputs of the model are parsed automatically and presented in a human readable way: each entity is highlighted and the explanation is displayed on hovering over the entity.
42+
43+
Exemplary output:
44+
45+
==============================
46+
{% html innerHTML="<style>:root{--font-size:16px}.entity{font-size:var(--font-size);padding:2px 4px;border-radius:4px;font-weight:700}@media (prefers-color-scheme:light){.entity-person{background-color:#add8e6;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-person{background-color:#add8e6;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}.entity-organization{background-color:#90ee90;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-organization{background-color:#90ee90;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}.entity-date{background-color:#f08080;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-date{background-color:#f08080;color:#000;border-radius:4px;padding:2px 4px;font-weight:700}}@media (prefers-color-scheme:dark){.entity-person{background-color:#00008b;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-person{background-color:#00008b;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}.entity-organization{background-color:#006400;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-organization{background-color:#006400;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}.entity-date{background-color:#8b0000;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-date{background-color:#8b0000;color:#fff;border-radius:4px;padding:2px 4px;font-weight:700}}</style><div><style>.entity-legend-person-light{background-color:#add8e6;color:#000;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-person-dark{background-color:#00008b;color:#fff;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-person{cursor:pointer;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-organization-light{background-color:#90ee90;color:#000;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-organization-dark{background-color:#006400;color:#fff;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-organization{cursor:pointer;border-radius:4px;padding:2px 4px;font-weight:700}.entity-legend-date-light{background-color:#f08080;color:#000;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-date-dark{background-color:#8b0000;color:#fff;padding:2px 4px;border-radius:4px;font-weight:700}.entity-legend-date{cursor:pointer;border-radius:4px;padding:2px 4px;font-weight:700}</style>Entities:<span class='entity-legend-person' title='PERSON: A name of an individual.' style='margin-right:4px'>PERSON</span><span class='entity-legend-organization' title='ORGANIZATION: A name of a company.' style='margin-right:4px'>ORGANIZATION</span><span class='entity-legend-date' title='DATE: A specific time reference.' style='margin-right:4px'>DATE</span></div><br><span class='entity entity-person' title='PERSON: Tim Cook is the name of an individual, specifically the CEO of Apple.'>Tim Cook</span>announced new<span class='entity entity-organization' title='ORGANIZATION: Apple is the name of a company, specifically a well-known technology company.'>Apple</span>products in San Francisco on<span class='entity entity-date' title='DATE: June 3, 2022 is a specific time reference, indicating a particular date.'>June 3, 2022</span>.<br><span class='entity entity-person' title='PERSON: Elon Musk is a well-known individual, making it a clear example of a PERSON entity.'>Elon Musk</span>visited the<span class='entity entity-organization' title='ORGANIZATION: Tesla is a well-known company, making it a clear example of an ORGANIZATION entity.'>Tesla</span>factory in Austin on<span class='entity entity-date' title='DATE: January 10, 2021 is a specific date, making it a clear example of a DATE entity.'>January 10, 2021</span>.<br><span class='entity entity-person' title='PERSON: Mark Zuckerberg is the name of an individual, specifically the CEO of Facebook.'>Mark Zuckerberg</span>introduced<span class='entity entity-organization' title='ORGANIZATION: Facebook is the name of a company, specifically a social media giant.'>Facebook</span>Metaverse in Silicon Valley on<span class='entity entity-date' title='DATE: May 5, 2023 is a specific time reference, indicating a particular date.'>May 5, 2023</span>.<br>"%}
47+
48+
{% /html %}
49+
==============================
50+
51+
The `display_output` functionality works in both Jupyter Notebook and plain Python scripts. When used outside Jupyter, a new HTML page will be auto-generated and opened in a new browser window.
52+
53+
54+
## Sparse vs Dense NER
55+
56+
We distinguish between two modes of generating the predictions: sparse and dense.
57+
58+
In dense mode the model produces a complete (tagged) output right away, while in sparse mode only a list of entities is produced which is then mapped to the text via regex.
59+
60+
In most of the scenarios the usage of sparse mode should be preferable for the following reasons:
61+
- lower number of output tokens (cheaper to use);
62+
- strict validation -> it is guaranteed that the output is invertable and only contains the specified entities;
63+
- higher accuracy, especially with smaller models.
64+
65+
Dense mode should only be used when the following conditions are met:
66+
- a larger model is used (e.g. gpt-4);
67+
- the text is expected to contain multiple (distinct) instances of lexically ambiguous words.
68+
69+
For example, in a sentence "**Apple** is the favorite fruit of the CEO of **Apple**", the first and second occurrences of the word "Apple" should be classified as different entities, which is only possible using the dense mode.
70+
71+
## API Reference
72+
73+
The following API reference only lists the parameters needed for the initialization of the estimator. The remaining methods follow the syntax of a scikit-learn transformer.
74+
75+
### GPTExplainableNER
76+
77+
```python
78+
from skllm.models.gpt.tagging.ner import GPTExplainableNER
79+
```
80+
81+
| **Parameter** | **Type** | **Description** |
82+
| ------------- | -------- | ------------------------ |
83+
| `entities` | `dict` | A dictionary of entities to recognize, with keys as **uppercase** entity names and values as descriptions. |
84+
| `display_predictions` | `Optional[bool]` | Determines whether to display predictions, by default False. |
85+
| `sparse_output` | `Optional[bool]` | Determines whether to generate a sparse representation of the predictions, by default True. |
86+
| `model` | `Optional[str]` | A model to use, by default "gpt-4o". |
87+
| `key` | `Optional[str]` | Estimator-specific API key; if None, retrieved from the global config, by default None. |
88+
| `org` | `Optional[str]` | Estimator-specific ORG key; if None, retrieved from the global config, by default None. |
89+
| `num_workers` | `Optional[int]` | Number of workers (threads) to use, by default 1. |

src/app/docs/quick-start/page.md

Lines changed: 0 additions & 92 deletions
This file was deleted.
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Overview
3+
nextjs:
4+
metadata:
5+
title: Overview
6+
description: Learn about text tagging.
7+
---
8+
9+
Tagging in Scikit-LLM can be an arbitrary task that takes a raw text and returns the same text with inserted XML-like tags.
10+
11+
For example, a sentiment analysis task could look as follows:
12+
13+
Input:
14+
```bash
15+
I love my new phone, but I am disappointed with the battery life.
16+
```
17+
18+
Output:
19+
```xml
20+
<positive>I love my new phone,</positive> <negative>but I am disappointed with the battery life.</negative>
21+
```
22+
23+
In an ideal scenario, such tagging process should be invertible, so the original text can always be reconstructed from the tagged one. However, this is not always feasible and hence not considered to be a mandatory requirement.

src/app/docs/use-cases-local-chatbot/page.md

Lines changed: 0 additions & 78 deletions
This file was deleted.

0 commit comments

Comments
 (0)