-
Notifications
You must be signed in to change notification settings - Fork 2
feat: RAG 기능 구현 #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
feat: RAG 기능 구현 #11
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Empty file.
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| import json | ||
| import chromadb | ||
| import google.generativeai as genai | ||
| from sentence_transformers import SentenceTransformer | ||
|
|
||
| # --- 설정 --- | ||
| SCENARIOS_PATH = "./data/scenarios.json" | ||
| DB_PATH = "./data/vector_db" | ||
| COLLECTION_NAME = "talky_scenarios" | ||
| EMBEDDING_MODEL_NAME = "all-MiniLM-L12-v2" # SBERT 모델 | ||
|
|
||
| # --- 전역 변수 (싱글톤) --- | ||
| _db_client = None | ||
| _scenario_collection = None | ||
| _embedding_model = None | ||
|
|
||
| def get_embedding_model(): | ||
| """임베딩 모델을 한 번만 로드하여 재사용 (싱글톤)""" | ||
| global _embedding_model | ||
| if _embedding_model is None: | ||
| _embedding_model = SentenceTransformer(EMBEDDING_MODEL_NAME) | ||
| return _embedding_model | ||
|
|
||
| def get_db(): | ||
| """ChromaDB 클라이언트와 컬렉션을 한 번만 초기화하여 재사용""" | ||
| global _db_client, _scenario_collection | ||
| if _db_client is None: | ||
| _db_client = chromadb.PersistentClient(path=DB_PATH) | ||
| _scenario_collection = _db_client.get_or_create_collection(name=COLLECTION_NAME) | ||
| return _scenario_collection | ||
|
|
||
| def build_database(): | ||
| """ | ||
| scenarios.json 파일을 읽어 ChromaDB에 벡터 데이터베이스를 구축하는 함수. | ||
| 최초 1회 또는 데이터 업데이트 시 실행합니다. | ||
| """ | ||
| collection = get_db() | ||
|
|
||
| # DB에 이미 데이터가 있으면 중복 구축 방지 | ||
| if collection.count() > 0: | ||
| print(f"이미 {collection.count()}개의 데이터가 존재합니다. 구축을 건너뜁니다.") | ||
| return | ||
|
|
||
| print("scenarios.json 파일을 읽어 데이터베이스 구축을 시작합니다...") | ||
|
|
||
| with open(SCENARIOS_PATH, "r", encoding="utf-8") as f: | ||
| scenarios = json.load(f) | ||
|
|
||
| model = get_embedding_model() | ||
|
|
||
| # 데이터 준비 | ||
| ids = [s["scenario_id"] for s in scenarios] | ||
| documents = [s["embedding_text"] for s in scenarios] | ||
| metadatas = [{"category": s["category"], "task": s["task"]} for s in scenarios] | ||
|
|
||
| # SBERT 모델로 임베딩 생성 | ||
| embeddings = model.encode(documents, convert_to_numpy=True).tolist() | ||
|
|
||
| # DB에 추가 | ||
| collection.add( | ||
| ids=ids, | ||
| embeddings=embeddings, | ||
| documents=[json.dumps(s["content"]) for s in scenarios], # content를 document로 저장 | ||
| metadatas=metadatas | ||
| ) | ||
|
|
||
| print(f"데이터베이스 구축 완료! 총 {collection.count()}개의 시나리오가 추가되었습니다.") | ||
|
|
||
| # 이 파일을 직접 실행하면 DB를 구축하도록 설정 | ||
| if __name__ == '__main__': | ||
| build_database() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| import json | ||
| from rag.database import get_db, get_embedding_model | ||
|
|
||
| def retrieve_scenario(query_text: str): | ||
| """ | ||
| 사용자의 입력(query_text)을 받아 가장 유사한 시나리오 1개를 검색하여 반환합니다. | ||
| """ | ||
| if not query_text.strip(): | ||
| return None | ||
|
|
||
| collection = get_db() | ||
| model = get_embedding_model() | ||
|
|
||
| # 1. 검색어 임베딩 | ||
| query_embedding = model.encode(query_text, convert_to_numpy=True).tolist() | ||
|
|
||
| # 2. ChromaDB에 쿼리 | ||
| results = collection.query( | ||
| query_embeddings=[query_embedding], | ||
| n_results=1 | ||
| ) | ||
|
|
||
| # 3. 결과 파싱 및 반환 | ||
| if results and results['ids'][0]: | ||
| scenario_id = results['ids'][0][0] | ||
| # document에 저장된 content json 문자열을 다시 파싱 | ||
| retrieved_content = json.loads(results['documents'][0][0]) | ||
| print(f"✅ RAG 성공: '{scenario_id}' 시나리오를 검색했습니다.") | ||
| return retrieved_content | ||
| else: | ||
| print("🟡 RAG: 유사한 시나리오를 찾지 못했습니다.") | ||
| return None |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if retrieved_scenario: 블록 이전에 example_dialogue_str 변수를 기본값(예: "없음")으로 초기화해서 NameError 방지하도록 수정했습니다