Skip to content

Commit 5e72ec2

Browse files
committed
Merge remote-tracking branch 'origin/main' into tests/text_splitter
2 parents 33d0538 + 0e7d22b commit 5e72ec2

File tree

202 files changed

+3081
-5242
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

202 files changed

+3081
-5242
lines changed

.semversioner/1.1.0.json

+58
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"changes": [
3+
{
4+
"description": "Make gleanings independent of encoding",
5+
"type": "minor"
6+
},
7+
{
8+
"description": "Remove DataShaper (first steps).",
9+
"type": "minor"
10+
},
11+
{
12+
"description": "Remove old pipeline runner.",
13+
"type": "minor"
14+
},
15+
{
16+
"description": "new search implemented as a new option for the api",
17+
"type": "minor"
18+
},
19+
{
20+
"description": "Fix gleanings loop check",
21+
"type": "patch"
22+
},
23+
{
24+
"description": "Implement cosmosdb storage option for cache and output",
25+
"type": "patch"
26+
},
27+
{
28+
"description": "Move extractor code to co-locate with operations.",
29+
"type": "patch"
30+
},
31+
{
32+
"description": "Remove config input models.",
33+
"type": "patch"
34+
},
35+
{
36+
"description": "Ruff update",
37+
"type": "patch"
38+
},
39+
{
40+
"description": "Simplify and streamline internal config.",
41+
"type": "patch"
42+
},
43+
{
44+
"description": "Simplify callbacks model.",
45+
"type": "patch"
46+
},
47+
{
48+
"description": "Streamline flows.",
49+
"type": "patch"
50+
},
51+
{
52+
"description": "fix instantiation of storage classes.",
53+
"type": "patch"
54+
}
55+
],
56+
"created_at": "2025-01-07T20:25:57+00:00",
57+
"version": "1.1.0"
58+
}

.semversioner/1.1.1.json

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"changes": [
3+
{
4+
"description": "Fix a bug on creating community hierarchy for dynamic search",
5+
"type": "patch"
6+
},
7+
{
8+
"description": "Increase LOCAL_SEARCH_COMMUNITY_PROP to 15%",
9+
"type": "patch"
10+
}
11+
],
12+
"created_at": "2025-01-08T21:53:16+00:00",
13+
"version": "1.1.1"
14+
}

.semversioner/1.1.2.json

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"changes": [
3+
{
4+
"description": "Basic Rag minor fix",
5+
"type": "patch"
6+
}
7+
],
8+
"created_at": "2025-01-09T22:29:23+00:00",
9+
"version": "1.2.0"
10+
}

.semversioner/next-release/patch-20241121202210026640.json

-4
This file was deleted.

.semversioner/next-release/patch-20241212190223784600.json

-4
This file was deleted.

.semversioner/next-release/patch-20241213181544864279.json

-4
This file was deleted.

.semversioner/next-release/patch-20241224192900934104.json

-4
This file was deleted.

.semversioner/next-release/patch-20241227225850465466.json

-4
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"type": "patch",
3+
"description": "Set default rate limits."
4+
}

CHANGELOG.md

+25
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,31 @@
11
# Changelog
22
Note: version releases in the 0.x.y range may introduce breaking changes.
33

4+
## 1.1.2
5+
6+
- patch: Basic Rag minor fix
7+
8+
## 1.1.1
9+
10+
- patch: Fix a bug on creating community hierarchy for dynamic search
11+
- patch: Increase LOCAL_SEARCH_COMMUNITY_PROP to 15%
12+
13+
## 1.1.0
14+
15+
- minor: Make gleanings independent of encoding
16+
- minor: Remove DataShaper (first steps).
17+
- minor: Remove old pipeline runner.
18+
- minor: new search implemented as a new option for the api
19+
- patch: Fix gleanings loop check
20+
- patch: Implement cosmosdb storage option for cache and output
21+
- patch: Move extractor code to co-locate with operations.
22+
- patch: Remove config input models.
23+
- patch: Ruff update
24+
- patch: Simplify and streamline internal config.
25+
- patch: Simplify callbacks model.
26+
- patch: Streamline flows.
27+
- patch: fix instantiation of storage classes.
28+
429
## 1.0.1
530

631
- patch: Fix encoding model config parsing

dictionary.txt

-4
Original file line numberDiff line numberDiff line change
@@ -148,10 +148,6 @@ codebases
148148
# Microsoft
149149
MSRC
150150

151-
# Broken Upstream
152-
# TODO FIX IN DATASHAPER
153-
Arrary
154-
155151
# Prompt Inputs
156152
ABILA
157153
Abila

docs/developing.md

+1
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ Available scripts are:
5151
- `poetry run poe test_unit` - This will execute unit tests.
5252
- `poetry run poe test_integration` - This will execute integration tests.
5353
- `poetry run poe test_smoke` - This will execute smoke tests.
54+
- `poetry run poe test_verbs` - This will execute tests of the basic workflows.
5455
- `poetry run poe check` - This will perform a suite of static checks across the package, including:
5556
- formatting
5657
- documentation formatting

docs/examples_notebooks/index_migration.ipynb

+2-3
Original file line numberDiff line numberDiff line change
@@ -206,9 +206,8 @@
206206
"metadata": {},
207207
"outputs": [],
208208
"source": [
209-
"from datashaper import NoopVerbCallbacks\n",
210-
"\n",
211209
"from graphrag.cache.factory import create_cache\n",
210+
"from graphrag.callbacks.noop_workflow_callbacks import NoopWorkflowCallbacks\n",
212211
"from graphrag.index.flows.generate_text_embeddings import generate_text_embeddings\n",
213212
"\n",
214213
"# We only need to re-run the embeddings workflow, to ensure that embeddings for all required search fields are in place\n",
@@ -220,7 +219,7 @@
220219
"config = workflow.config\n",
221220
"text_embed = config.get(\"text_embed\", {})\n",
222221
"embedded_fields = config.get(\"embedded_fields\", {})\n",
223-
"callbacks = NoopVerbCallbacks()\n",
222+
"callbacks = NoopWorkflowCallbacks()\n",
224223
"cache = create_cache(pipeline_config.cache, PROJECT_DIRECTORY)\n",
225224
"\n",
226225
"await generate_text_embeddings(\n",

docs/index/architecture.md

+2-26
Original file line numberDiff line numberDiff line change
@@ -8,33 +8,9 @@ In order to support the GraphRAG system, the outputs of the indexing engine (in
88
This model is designed to be an abstraction over the underlying data storage technology, and to provide a common interface for the GraphRAG system to interact with.
99
In normal use-cases the outputs of the GraphRAG Indexer would be loaded into a database system, and the GraphRAG's Query Engine would interact with the database using the knowledge model data-store types.
1010

11-
### DataShaper Workflows
12-
13-
GraphRAG's Indexing Pipeline is built on top of our open-source library, [DataShaper](https://github.com/microsoft/datashaper).
14-
DataShaper is a data processing library that allows users to declaratively express data pipelines, schemas, and related assets using well-defined schemas.
15-
DataShaper has implementations in JavaScript and Python, and is designed to be extensible to other languages.
16-
17-
One of the core resource types within DataShaper is a [Workflow](https://github.com/microsoft/datashaper/blob/main/javascript/schema/src/workflow/WorkflowSchema.ts).
18-
Workflows are expressed as sequences of steps, which we call [verbs](https://github.com/microsoft/datashaper/blob/main/javascript/schema/src/workflow/verbs.ts).
19-
Each step has a verb name and a configuration object.
20-
In DataShaper, these verbs model relational concepts such as SELECT, DROP, JOIN, etc.. Each verb transforms an input data table, and that table is passed down the pipeline.
21-
22-
```mermaid
23-
---
24-
title: Sample Workflow
25-
---
26-
flowchart LR
27-
input[Input Table] --> select[SELECT] --> join[JOIN] --> binarize[BINARIZE] --> output[Output Table]
28-
```
29-
30-
### LLM-based Workflow Steps
31-
32-
GraphRAG's Indexing Pipeline implements a handful of custom verbs on top of the standard, relational verbs that our DataShaper library provides. These verbs give us the ability to augment text documents with rich, structured data using the power of LLMs such as GPT-4. We utilize these verbs in our standard workflow to extract entities, relationships, claims, community structures, and community reports and summaries. This behavior is customizable and can be extended to support many kinds of AI-based data enrichment and extraction tasks.
33-
34-
### Workflow Graphs
11+
### Workflows
3512

3613
Because of the complexity of our data indexing tasks, we needed to be able to express our data pipeline as series of multiple, interdependent workflows.
37-
In the GraphRAG Indexing Pipeline, each workflow may define dependencies on other workflows, effectively forming a directed acyclic graph (DAG) of workflows, which is then used to schedule processing.
3814

3915
```mermaid
4016
---
@@ -55,7 +31,7 @@ stateDiagram-v2
5531
The primary unit of communication between workflows, and between workflow steps is an instance of `pandas.DataFrame`.
5632
Although side-effects are possible, our goal is to be _data-centric_ and _table-centric_ in our approach to data processing.
5733
This allows us to easily reason about our data, and to leverage the power of dataframe-based ecosystems.
58-
Our underlying dataframe technology may change over time, but our primary goal is to support the DataShaper workflow schema while retaining single-machine ease of use and developer ergonomics.
34+
Our underlying dataframe technology may change over time, but our primary goal is to support the workflow schema while retaining single-machine ease of use and developer ergonomics.
5935

6036
### LLM Caching
6137

0 commit comments

Comments
 (0)