Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions eventbridge-scheduled-stepfunction-bedrock-kb-sync/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Bedrock Knowledge Base Synchronization Flow with EventBridge Scheduler

This pattern demonstrates an automated synchronization process for Amazon Bedrock Knowledge Bases using EventBridge Scheduler and Step Functions. The solution enables periodic synchronization of data sources, ensuring your Knowledge Base stays up-to-date with the latest content.

Learn more about this pattern at Serverless Land Patterns: https://serverlessland.com/patterns/eventbridge-scheduled-stepfunction-bedrock-kb-sync


Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the [AWS Pricing page](https://aws.amazon.com/pricing/) for details. You are responsible for any AWS costs incurred. No warranty is implied in this example.

## Architecture
![Architecture diagram](docs/images/KBSyncPipeline.jpg)

## Requirements

* [Create an AWS account](https://portal.aws.amazon.com/gp/aws/developer/registration/index.html) if you do not already have one and log in. The IAM user that you use must have sufficient permissions to make necessary AWS service calls and manage AWS resources.
* [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html) installed and configured
* [Git Installed](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
* [AWS CDK](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html) (AWS CDK) installed

## Deployment Instructions

1. Create a new directory, navigate to that directory in a terminal and clone the GitHub repository:
```
git clone https://github.com/aws-samples/serverless-patterns
```
2. Change directory to the pattern directory:
```
cd serverless-patterns/eventbridge-scheduled-stepfunction-bedrock-kb-sync/cdk
```
3. Setup local developer environment and dependencies:
```
make bootstrap-venv
source .venv/bin/activate
```
4. From the command line, configure AWS CDK:
```bash
cdk bootstrap
```
5. From the command line, use AWS CDK to deploy the AWS resources for the pattern as specified in the `lib/appsync-eventbridge-datasource-stack.ts` file:
```bash
cdk deploy --all
```
6. This command will take sometime to run. After successfully completing, the below stacks deployed.
```
KbRoleStack
CommonLambdaLayerStack
OSSStack
KbSyncPipelineStack
KbInfraStack
```

## How it works

Here's a detailed summary of your serverless pattern for automated Knowledge Base synchronization:

Pattern Overview: This is a scheduled, serverless workflow that automates the synchronization of Bedrock Knowledge Bases using AWS EventBridge Scheduler, AWS Step Functions, and Amazon Bedrock.

Key Components:
a) EventBridge Scheduler
- Runs every 15 minutes
- Triggers the Step Function workflow
- Passes Bedrock Knowledge Base ID as input parameter
- Enables consistent and automated synchronization

b) Step Functions Workflow
-Main Flow:
- Receives Knowledge Base ID from EventBridge
- Orchestrates the entire synchronization process
- Handles error scenarios and retries
- Manages parallel processing of multiple data sources

Step 1: Data Source Retrieval
Queries all associated data sources for the given Knowledge Base ID
Prepares the list for processing
Validates data source configurations

Step 2: Map State for Parallel Processing
Iterates through each data source
Processes multiple data sources concurrently
Manages state for each sync operation

Step 3: Synchronization Process (For each data source)
Initiates the sync operation
Monitors sync status
Handles completion and failures
Reports sync results

Step 4: Status Reporting
Aggregates sync results
Records success/failure metrics
Generates summary reports

## Testing

Step 1: Upload Sample Documents to S3
- Navigate to Amazon S3 in AWS Console
- Locate the bucket named kb-data-source-{account-id}
- Upload your sample documents to this bucket

Step 2: Wait for Scheduler Execution
- The EventBridge scheduler is configured to run every 15 minutes
- You can monitor the scheduler in EventBridge console
Note: The next execution will occur at the next 15-minute interval

Step 3: Monitor Step Function Execution
- Navigate to AWS Step Functions console
- Find your state machine execution
- Monitor the workflow progress through different states
- Verify successful completion of all steps

Step 4: Verify Sync Status in Bedrock
- Go to Amazon Bedrock console
- Navigate to Knowledge Bases
- Select your Knowledge Base
- Click on Data Sources
- Check the Sync History tab
- Verify the sync status shows as "Completed"
- Review sync details including:
Timestamp of sync
Number of documents processed
Any errors or warnings


Step 45: Validation Points
- Confirm documents are indexed
- Check sync completion status
- Verify no errors in sync history
- Ensure all uploaded documents are processed

Troubleshooting
If sync fails or documents aren't appearing:

Check S3 bucket permissions
Review Step Function execution logs
Verify document format compatibility
Check Knowledge Base configuration

![KB Pipeline Architecture](docs/images/KBSyncPipeline.png)

## Delete stack

```bash
cdk destroy --all
```
----
Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.

SPDX-License-Identifier: MIT-0
184 changes: 184 additions & 0 deletions eventbridge-scheduled-stepfunction-bedrock-kb-sync/cdk/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# UV
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
#uv.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# Ruff stuff:
.ruff_cache/

# PyPI configuration file
.pypirc

# CDK asset staging directory
.cdk.staging
cdk.out

# Misc
unittests.xml

.coverage
cov.xml
Loading