Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The relation extracted by the Chinese prompt template is incorrect. #756

Closed
2 tasks
yangxue-1 opened this issue Jul 29, 2024 · 3 comments
Closed
2 tasks
Labels
community_support Issue handled by community members

Comments

@yangxue-1
Copy link

yangxue-1 commented Jul 29, 2024

Is there an existing issue for this?

  • I have searched the existing issues
  • I have checked #657 to validate if my issue is covered by community support

Describe the issue

question 1:
After the content of the four prompt templates is changed to Chinese, the extracted relation format is incorrect.

examples 1:

("entity"<|>"新闻单位"<|>"media"<|>"指从事新闻采集、编辑和发布的媒体机构,如报纸、杂志、广播电台、电视台等。")##
("entity"<|>"出版单位"<|>"media"<|>"负责书籍、期刊、电子书等出版发行的机构。")##
("entity"<|>"广播电台"<|>"media"<|>"通过无线电波传播声音信息的媒体平台,通常用于新闻播报、音乐播放和节目主持。")##
("entity"<|>"电视台"<|>"media"<|>"利用电视信号传输图像与声音信息的媒体平台,提供新闻、娱乐、教育等各类节目内容。")##
("relationship"<|>"新闻单位"<|>"进行道路交通安全教育"<|>"新闻单位有义务通过报道和宣传来促进公众对道路交通安全的认识和理解。"<|>10)##
("relationship"<|>"出版单位"<|>"进行道路交通安全教育"<|>"出版物可以通过文章、书籍等形式传播道路交通安全知识,提高公众意识。"<|>10)##
("relationship"<|>""<|>"进行道路交通安全教育"<|>"通过广播节目向听众普及交通安全法规和常识。"<|>10)##
("relationship"<|>"电视台"<|>"进行道路交通安全教育"<|>"电视节目可以制作专题、访谈等形式,深入讲解交通安全知识与案例分析。"<|>10)##

example 2 :
("entity"<|>"车辆专用的或者与其相类似的标志图案"<|>"objects"<|>"指用于特定用途(如警车、消防车等)的车辆标识图案。")##
("entity"<|>"警报器"<|>"objects"<|>"一种发出高音警报声音以引起注意或警示的设备。")##
("entity"<|>"标志灯具"<|>"objects"<|>"用于提供视觉警示,通常在夜间或低能见度条件下使用,具有特定颜色和闪烁模式的灯具。")##
("relationship"<|>"上述车辆"<|>"vehicles"<|>"指包括但不限于警车、消防车等执行特殊任务的机动车辆。它们可能配备有专用标志图案、警报器或标志灯具以辅助其功能。"<|>10)##
("legal clause No."<|>"第二十五条"<|>"规定了全国实行统一的道路交通信号,包括交通信号灯、交通标志、交通标线和交通警察的指挥,并强调这些设施应符合安全畅通的要求及国家标准。")##
("legal clause No."<|>"第五十四条"<|>"允许道路养护车辆、工程作业车在进行作业时不受常规交通标志、标线限制,但要求过往车辆和人员注意避让以确保安全。")##
("acts"<|>"洒水"<|>"指对路面进行湿润或清洁的活动,可能与特定的道路维护或清洁任务相关联。")

question2:
What types of entities are supported? How do I make changes and statements?

example :
("entity"<|>"五千元罚款"<|>"legal clause No."<|>"指根据法律规定的对违法行为进行经济惩罚的方式之一。")##
("entity"<|>"构成犯罪"<|>"legal clause No."<|>"指行为触犯了刑法,应承担刑事责任。")##

Steps to reproduce

No response

GraphRAG Config Used

# Paste your config here

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ollama
  type: openai_chat # or azure_openai_chat
  model: qwen2
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  api_base: http://10.108.246.106:11434/v1
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ollama
    type: openai_embedding # or azure_openai_embedding
    model: qwen2
    api_base: http://10.108.246.106:8080
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  batch_size: 1 # the number of documents to send in a single request
  # batch_max_tokens: 8192 # the maximum number of tokens to send in a single request
    # target: required # or optional
  


chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,vehicles,objects,acts,events,documents,legal clause No.,media,standard,legal term,attributes]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:0.1.1
  • Operating System:Ubuntu 22.04
  • Python Version:3.10
  • Related Issues:qwen2:7b
@yangxue-1 yangxue-1 added the triage Default label assignment, indicates new issue needs reviewed by a maintainer label Jul 29, 2024
@KylinMountain
Copy link
Contributor

你看看你的Example是不是搞错了,怎么都是这样了?你确定这是你的Prompt内的东西?不应该是{tuple_delimeter}和{record_delimeter}吗? 这看起来更像是执行后已经修改了的东西。

("entity"<|>"新闻单位"<|>"media"<|>"指从事新闻采集、编辑和发布的媒体机构,如报纸、杂志、广播电台、电视台等。")##

@yangxue-1
Copy link
Author

你看看你的Example是不是搞错了,怎么都是这样了?你确定这是你的Prompt内的东西?不应该是{tuple_delimeter}和{record_delimeter}吗? 这看起来更像是执行后已经修改了的东西。

("entity"<|>"新闻单位"<|>"media"<|>"指从事新闻采集、编辑和发布的媒体机构,如报纸、杂志、广播电台、电视台等。")##

对,这是利用qwen2进行提问得到的回答,与prompt的格式不太一样了

@natoverse
Copy link
Collaborator

Please see #657 and #696

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 29, 2024
@natoverse natoverse added community_support Issue handled by community members and removed triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests

3 participants