Skip to content

Latest commit

 

History

History
155 lines (101 loc) · 9.48 KB

README.md

File metadata and controls

155 lines (101 loc) · 9.48 KB

MK SQuIT: Synthesizing Questions using Iterative Template-Filling

MK-SQuIT is a synthetic dataset containing English and SPARQL query pairs. The assembly of question-query pairs is handled with very little human intervention, sidestepping the tedious and costly process of hand labeling data. A neural machine translation model can then be trained on such a dataset, allowing laymen users to access information rich knowledge graphs without an understanding of query syntax.

For further details, see our publication: MK-SQuIT: Synthesizing Questions using Iterative Template-filling

This repository contains all the tools needed to synthetically generate question/query pairs (Text2Sparql) from a given knowledge base:

  • Generation Pipeline: mk_squit/generation
  • Example Data: data
  • Generated Dataset: out
  • Baseline Model Finetuning and Evaluation: model
  • Metrics: mk_squit/metrics
  • Example Entity Resolver: mk_squit/entity_resolver

Data Generation

The code in this repository, while easily adapted, is engineered for a specific dataset (WikiData), query language (SPARQL), natural language (English), and set of question/query types. Generation of data can be reproduced with the following steps:

1 . Saving Raw Data: By default, the pipeline expects entity data to be saved to files named "*-5k.json" and properties to "*-props.json".

python -m scripts.gather_wikidata --data-dir data

2. Preprocessing: The data must be cleaned and annotated before fed into the pipeline.

python -m scripts.preprocess \
    --data-dir ./data \
    --ent-id *-5k.json \
    --prop-id *-props.json \
    --num-examples-to-generate 10

Several files are generated to handle critical roles:

  • *-5k.json -> *-5k-preprocessed.json Entities typically have a label and alternative labels. All labels are cleaned and grouped into a single listed field.
  • *-props.json -> *-props-preprocessed.json Similar to entities, property labels and alternative labels are grouped. Each label is then converted into a part-of-speech tag (POS-tag) for coherent mapping within a template (ex. "set in location" -> "VERB-NOUN"). Lastly, a typing field is added of the format [domain]-> but must be annotated to include the type [domain]->[type].
  • pos-examples.txt Samples of part-of-speech tags are sorted by number of occurrences within the raw data. This is an optional file used to determine which POS-tags are of importance for template generation.

3. Annotation of Types: Each *-props-preprocessed.json file contains a list of json objects for each property. Each property has a field of type which must be annotated before proceeding to the next step. Ex: "type": "{domain}->" must be modified to "type": "{domain}->{type}".

The type specifies the general category the property falls into. Properties "location of" and "location at" could be categorized as "location" whereas "built during" and "created at" could be categorized as "time". To a certain extent, typing is subjective, but allows the pipeline to string together much more coherent statements. For a list of types that we use, refer to the WH_LABEL_DICT within scripts/generate_type_list.py, which maps a type to a question prefix.

Modifying WH_LABEL_DICT within scripts/generate_type_list.py may be necessary if additional types are required. This dictionary maps a type to its question type. For example, asking about a type of "genre" would typically require "what" - "What is the genre of that movie?". Asking about a "person" would use "who" - "Who is that person?".

While annotation of generic types and question prefixes require a manual element, they improve the generation of rational queries substantially.

4. Generate Type List: Consolidate property types, start domains, and type metadata into a type-list-autogenerated.json file.

python -m scripts.generate_type_list \
    --data-dir ./data \
    --prop-id *-props-preprocessed.json

The data is now ready to be fed into the pipeline. You should have the following files:

  • Entity data: *-5k-preprocessed.json
  • Property data: *-props-preprocessed.json
  • Part-of-Speech examples (optional): pos-examples.txt
  • Type metadata list: type-list-autogenerated.json

5. Generating the Questions and Queries:

Generating datasets with the code in its current form is very simple:

python -m mk_squint.generation.full_query_generator \
    --data-dir data \
    --prop-id *-props-preprocessed.json \
    --ent-id *-5k-preprocessed.json \
    --out-dir out

This code will generate a 100k training set and a 5k easy testing set in out.

The code synthetically generates questions and queries using multiple layers of question/query templating. First, a baseline question template is generated from a Context-Free Grammar (CFG). Second, the baseline template is numbered according to the order of the predicates/arguments in the logical form of the template (the numbering functions are responsible for figuring this out). Third, the numbered template is ontologically typed so that when predicates and arguments (a.k.a. entities and properties) are sampled, their types do not conflict. Lastly, the predicates and arguments are sampled and inserted into the question template and into a SPARQL query template based on the numbered order of the items in the numbered and typed template. This process is explained more thoroughly in the paper.

6. Generating Test Hard dataset:

The Test Hard dataset is a variation of the Test Easy dataset that has deeper and fuzzier baseline template productions and an exclusive chemical domain. In order to generate this dataset, some code needs to be (un)commented. Look through generation/full_query_generator.py, generation/template_filler.py and template_generator.py and uncomment sections of code annotated with a TEST_HARD comment. You might need to comment out some other code after uncommenting this code, i.e. in generation/template_generator.py you should comment out lines 86-88 and uncomment lines 89-91. After you've made the (un)comments, simply run the same generation code in section 5:

python -m mk_squint.generation.full_query_generator

Data Format

All data generated by the generator will produce files like this:

english sparql unique hash
What is the height of Getica's creator? SELECT ?end WHERE { [ Getica ] wdt:P50 / wdt:P2048 ?end . } 0ea54cd5187baf7239c8f2023ae33bb3001c5a49

Customization

Each stage can be modified:

1. Using a custom dataset

Difficulty: Low-Mid

Entities and predicates of the triplestore database are necessary, with each entity having an entity type, ID, label, and any label aliases and each predicate having a predicate type, ID, label, and any label aliases. Using a typed database is critical, as the rules used to generate queries leverage semantic knowledge of entities and predicates. To this degree, some manual annotation of predicate types is required.

2. Choosing a different query language

Difficulty: Low

The exact syntax of the generated queries can be modified in mk_squint/generation/template_filler.py. The construct_query_pair() function and the fill_*_ent_query() functions would need modification to accomodate changes in syntax.

The code in this repository is designed to generate queries in SPARQL using some syntactic sugar:

SELECT ?end WHERE { [ Marcelo Bielsa ] wdt:P3448 / wdt:P31 ?end . }

Note that the entity labels have not be converted into their IDs. We leave this problem of entity resolution to downstream processes in the translation pipeline.

3. Choosing a different output natural language

Difficulty: High

This work leverages the natural predicate-argument structure of (English) language to generate corresponding questions and queries. This method is perhaps generalizable to other languages that have a similar predicate-argument structure, but would be difficult to generalize to languages where that structure is less syntactically constrained. Almost all relevant code would be in mk_squint/generation/template_generator.py.

Consider generating questions in English and then translating them over to another natural language using an off-the-shelf machine translation model.

4. Designing new question/query types

**Difficulty: **Low-Mid

Question types are defined by context-free grammars (CFGs) located in mk_squint/generation/template_generator.py. The type_template() and number_*_ent() functions may need to be modified to accomodate novel semantic constructions.

Query types, which correspond to question types, are defined by the fill_*_query() functions present in mk_squint/generation/template_filler.py. Novel semantic constructions may also needed to be accomodated for in construct_query_pair().

The code in this repository implements three question types:

  1. single_entity: What was the nationality of Michael Jordan?

  2. multi_entity: Is Michael Jordan the friend of Roger Rabbit?

  3. count:How many sons does Michael Jordan have?

Citation

@article{mk-squit,
	title = {MK-SQuIT: Synthesizing Questions using Iterative Template-filling},
	author = {Benjamin A. Spiegel and Vincent Cheong and James E. Kaplan and Anthony Sanchez},
	journal = {arXiv preprint arXiv:2011.02566},
	year = {2020},
}