Skip to content
This repository has been archived by the owner on Nov 18, 2023. It is now read-only.

Commit

Permalink
Pre release upgrade (#33)
Browse files Browse the repository at this point in the history
* Upgrading to use Grakn commit 760c46dc1f19d572c2abaa61fc5a16bf4ced4312

* Upgrades kglib to use Grakn commit 760c46dc1f19d572c2abaa61fc5a16bf4ced4312, mostly requiring syntax changes and the temporary lack of limits to query result length

* Minor changes to printing confusion matrices and properly passes attribute label values

* Improves READMEs

* Fix for label_extraction_test
  • Loading branch information
jmsfltchr authored Jan 25, 2019
1 parent 9699652 commit f171590
Show file tree
Hide file tree
Showing 11 changed files with 53 additions and 50 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Research
This repository is the centre of all research projects conducted at Grakn Labs. In particular, it's focus is on the integration of machine learning with the Grakn knowledge graph.

Our first project is on [*Knowledge Graph Convolutional Networks* (KGCNs)](/kglib/kgcn).
At present this repo contains one project: [*Knowledge Graph Convolutional Networks* (KGCNs)](/kglib/kgcn).

35 changes: 19 additions & 16 deletions kglib/kgcn/README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,48 @@
# Knowledge Graph Convolutional Networks (KGCNs)

This project introduces a novel model: the Knowledge Graph Convolutional Network. The principal idea of this work is to build a bridge between knowledge graphs and machine learning. KGCNs can be used to create vector representations or *embeddings* of any labelled set of Grakn concepts. As a result, a KGCN can be trained directly for the classification or regression of Concepts stored in Grakn. Future work will include building embeddings via unsupervised learning.![KGCN Process](readme_images/KGCN_process.png)
This project introduces a novel model: the *Knowledge Graph Convolutional Network* (KGCN). The principal idea of this work is to forge a bridge between knowledge graphs and machine learning, using [Grakn](https://github.com/graknlabs/grakn) as the knowledge graph. A KGCN can be used to create vector representations, *embeddings*, of any labelled set of Grakn Concepts via supervised learning. As a result, a KGCN can be trained directly for the classification or regression of Concepts stored in Grakn. Future work will include building embeddings via unsupervised learning.![KGCN Process](readme_images/KGCN_process.png)



## Methodology

The ideology behind this project is described [here](https://blog.grakn.ai/knowledge-graph-convolutional-networks-machine-learning-over-reasoned-knowledge-9eb5ce5e0f68). The principles of the implementation are based on [GraphSAGE](http://snap.stanford.edu/graphsage/), from the Stanford SNAP group, made to work over a **knowledge graph**. Instead of working on a typical property graph, a KGCN learns from the context of a *typed hypergraph*, Grakn. Additionally, it learns from facts deduced by Grakn's *automated logical reasoner*. From this point on some understanding of [Grakn's docs](http://dev.grakn.ai) is assumed.
The ideology behind this project is described [here](https://blog.grakn.ai/knowledge-graph-convolutional-networks-machine-learning-over-reasoned-knowledge-9eb5ce5e0f68). The principles of the implementation are based on [GraphSAGE](http://snap.stanford.edu/graphsage/), from the Stanford SNAP group, made to work over a **knowledge graph**. Instead of working on a typical property graph, a KGCN learns from the context of a *typed hypergraph*, **Grakn**. Additionally, it learns from facts deduced by Grakn's *automated logical reasoner*. From this point onwards some understanding of [Grakn's docs](http://dev.grakn.ai) is assumed.

#### How does a KGCN work?
#### How do KGCNs work?

The purpose of this method is to derive embeddings for a set of Concepts (and thereby directly learn to classify them). We start by querying Grakn to find a set of examples with labels. Following that, we gather data about the neighbourhood of each example Concept. We do this by considering their *k-hop* neighbours.
The purpose of this method is to derive embeddings for a set of Concepts (and thereby directly learn to classify them). We start by querying Grakn to find a set of labelled examples. Following that, we gather data about the neighbourhood of each example Concept. We do this by considering their *k-hop* neighbours.

![Screenshot 2019-01-24 at 19.00.31](readme_images/k-hop_neighbours.png)We retrieve the data concerning this neighbourhood from Grakn. This includes information on the *types*, *roles*, and *attribute* values of each neighbour encountered.
![k-hop neighbours](readme_images/k-hop_neighbours.png)We retrieve the data concerning this neighbourhood from Grakn. This information includes the *type hierarchy*, *roles*, and *attribute* values of each neighbouring Concept encountered.

To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a 'summary' representation remains - an embedding. In our example these embeddings are directly optimised to perform multi-class classification via a single subsequent dense layer and softmax cross entropy.
To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a 'summary' representation remains - an embedding. In our example these embeddings are directly optimised to perform multi-class classification. This is achieved by passing the embeddings to a single subsequent dense layer and determining loss via softmax cross entropy with the labels retrieved.

![Screenshot 2019-01-24 at 19.03.08](readme_images/aggregate_and_combine.png)
![Aggregation and Combination process](readme_images/aggregate_and_combine.png)



## Example - CITES Animal Trade Data
## Usage by example - CITES Animal Trade Data

#### Quickstart
### Quickstart

**Requirements:**

- Python 3.6.3 or higher

- kglib installed from pip: `pip install --extra-index-url https://test.pypi.org/simple/ grakn-kglib`
- The `animaltrade` dataset from the latest release. This is a dataset that has been pre-loaded into Grakn v1.5 (so you don't have to run the data import yourself), with two keyspaces: `animaltrade_train` and `animaltrade_test`.
- The `grakn-animaltrade.zip` dataset from the [latest release](https://github.com/graknlabs/kglib/releases/latest). This is a dataset that has been pre-loaded into Grakn v1.5 (so you don't have to run the data import yourself), with two keyspaces: `animaltrade_train` and `animaltrade_test`.

**To use:**

- Prepare the data:

- If you already have an insatnce of Grakn running, make sure to stop it using `./grakn server stop`

- Download the pre-loaded Grakn distribution from the [latest release](https://github.com/graknlabs/kglib/releases/latest)

- Unzip the pre-loaded Grakn + dataset from the latest release, the location you store this in doesn't matter
- Unzip the distribution `unzip grakn-animaltrade.zip `, where you store this doesn't matter

- `cd` into the dataset and start Grakn: `./grakn server start`
- cd into the distribution `cd grakn-animaltrade`

- start Grakn `./grakn server start`

- Confirm that the training keyspace is present and contains data

Expand All @@ -61,12 +64,12 @@ To create embeddings, we build a network in TensorFlow that successively aggrega

The CITES dataset details exchanges of animal-based products between countries. In this example we aim to predict the value of `appendix` for a set of samples. This `appendix` can be thought of as the level of endangerment that a `traded-item` is subject to, where `1` represents the highest level of endangerment, and `3` the lowest.

The `main` function will:
The [main](examples/animal_trade/main.py) function will:

- Search Grakn for 30 concepts (with an labels) to use as the training set, 30 for the evaluation set, and 30 for the prediction set using queries such as:
- Search Grakn for 30 concepts (with attributes as labels) to use as the training set, 30 for the evaluation set, and 30 for the prediction set using queries such as (limiting the returned stream):

```
match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix 1; limit 30; get;
match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix 1; get;
```

This searches for an `exchange` between countries that has an `appendix` (endangerment level) of `1`, and finds the `traded-item` that was exchanged
Expand Down
11 changes: 6 additions & 5 deletions kglib/kgcn/examples/animal_trade/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,12 @@
flags.DEFINE_integer('max_training_steps', 10000, 'Max number of gradient steps to take during gradient descent')

# Sample selection params
EXAMPLES_QUERY = 'match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix {}; ' \
'limit {}; get;'
EXAMPLES_QUERY = 'match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix {}; get;'
LABEL_ATTRIBUTE_TYPE = 'appendix'
ATTRIBUTE_VALUES = [1, 2, 3]
EXAMPLE_CONCEPT_TYPE = 'traded-item'

NUM_PER_CLASS = 30
NUM_PER_CLASS = 10
POPULATION_SIZE_PER_CLASS = 1000

# Params for persisting to files
Expand Down Expand Up @@ -125,8 +125,9 @@ def main(modes=(TRAIN, EVAL, PREDICT)):
PREDICT: {'sample_size': NUM_PER_CLASS, 'population_size': POPULATION_SIZE_PER_CLASS},
}
concepts, labels = samp_mgmt.compile_labelled_concepts(EXAMPLES_QUERY, EXAMPLE_CONCEPT_TYPE,
LABEL_ATTRIBUTE_TYPE, transactions[TRAIN],
transactions[PREDICT], sampling_params)
LABEL_ATTRIBUTE_TYPE, ATTRIBUTE_VALUES,
transactions[TRAIN], transactions[PREDICT],
sampling_params)
prs.save_labelled_concepts(KEYSPACES, concepts, labels, SAVED_LABELS_PATH)

samp_mgmt.delete_all_labels_from_keyspaces(transactions, LABEL_ATTRIBUTE_TYPE)
Expand Down
6 changes: 3 additions & 3 deletions kglib/kgcn/examples/animal_trade/schema.gql
Original file line number Diff line number Diff line change
Expand Up @@ -139,11 +139,11 @@ define
relates member-item,
relates taxonomic-group;

taxonomic-ranking sub rule,
taxonomic-ranking
when {
(super-taxon: $a, sub-taxon: $b) isa taxonomic-hierarchy;
(super-taxon: $b, sub-taxon: $c) isa taxonomic-hierarchy;
}
},
then {
(super-taxon: $a, sub-taxon: $c) isa taxonomic-hierarchy;
};
Expand All @@ -152,7 +152,7 @@ define
when {
(member-item: $a, taxonomic-group: $taxon) isa taxon-membership;
(sub-taxon: $taxon, super-taxon: $super) isa taxonomic-hierarchy;
}
},
then {
(member-item: $a, taxonomic-group: $super) isa taxon-membership;
};
16 changes: 8 additions & 8 deletions kglib/kgcn/management/samples.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,13 @@ def query_for_random_samples_with_attribute(tx, query, example_var_name, attribu
labels = {}

for a in attribute_vals:
target_concept_query = query.format(a, population_size)
target_concept_query = query.format(a)

extractor = label_extraction.ConceptLabelExtractor(target_concept_query,
(example_var_name, collections.OrderedDict(
[(attribute_var_name, attribute_vals)])),
sampling_method=random.random_sample
)
concepts_with_labels = extractor(tx, sample_size_per_label)
sampling_method=random.random_sample)
concepts_with_labels = extractor(tx, sample_size_per_label, population_size)
if len(concepts_with_labels) == 0:
raise RuntimeError(f'Couldn\'t find any concepts to match target query "{target_concept_query}"')

Expand All @@ -51,12 +50,13 @@ def query_for_random_samples_with_attribute(tx, query, example_var_name, attribu
return concepts, labels


def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_name, train_and_eval_transaction,
predict_transaction, sampling_params):
def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_name, attribute_values,
train_and_eval_transaction, predict_transaction, sampling_params):
"""
Assumes the case that data is partitioned into 2 keyspaces, one for training and evaluation, and another for
prediction on unseen data (with labels). Therefore this function draws training and evaluation samples from the
same keyspace.
:param attribute_values:
:param samples_query: Query to use to select possible samples
:param concept_var_name: The variable used for the example concepts within the `samples_query`
:param attribute_var_name: The variable used for the samples' labels (attributes) within the `samples_query`
Expand All @@ -71,7 +71,7 @@ def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_nam
print(' for training and evaluation')
concepts_dicts, labels_dicts = \
query_for_random_samples_with_attribute(train_and_eval_transaction, samples_query,
concept_var_name, attribute_var_name, [1, 2, 3],
concept_var_name, attribute_var_name, attribute_values,
sampling_params['train']['sample_size'] +
sampling_params['eval']['sample_size'],
sampling_params['train']['population_size'] +
Expand All @@ -81,7 +81,7 @@ def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_nam
query_for_random_samples_with_attribute(predict_transaction,
samples_query,
concept_var_name,
attribute_var_name, [1, 2, 3],
attribute_var_name, attribute_values,
sampling_params['predict']['sample_size'],
sampling_params['predict']['population_size'])

Expand Down
4 changes: 0 additions & 4 deletions kglib/kgcn/models/downstream.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,8 +142,6 @@ def train(self, feed_dict):
print(f'\n-----')
print(f'Step {step}')
print(f'Loss: {loss_value:.2f}')
print(f'Confusion Matrix:')
print(confusion_matrix)
metrics.report_multiclass_metrics(labels_winners_values, predictions_class_winners_values)
print("========= Training Complete =========\n\n")

Expand All @@ -157,8 +155,6 @@ def eval(self, feed_dict):
self._labels_winners])

print(f'Loss: {loss_value:.2f}')
print(f'Confusion Matrix:')
print(confusion_matrix)
metrics.report_multiclass_metrics(labels_winners_values, predictions_class_winners_values)
print("========= Evaluation Complete =========\n\n")

Expand Down
2 changes: 1 addition & 1 deletion kglib/kgcn/neighbourhood/data/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ class TraversalExecutor:
}

ATTRIBUTE_QUERY = {
'query': 'match $thing id {} has attribute $attribute; get $attribute;',
'query': 'match $thing id {}, has attribute $attribute; get $attribute;',
'variable': 'attribute'
}

Expand Down
2 changes: 1 addition & 1 deletion kglib/kgcn/neighbourhood/data/executor_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def setUp(self):

class TestTraversalExecutorFromDateAttribute(BaseTestTraversalExecutor.TestTraversalExecutor):

query = "match $attribute isa date-started 2015-11-12T00:00; limit 1; get;"
query = "match $attribute isa date-started; $attribute 2015-11-12T00:00; get;"
var = 'attribute'
roles = ['has']
num_results = 1
Expand Down
12 changes: 7 additions & 5 deletions kglib/kgcn/test_data/schema.gql
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,19 @@

define

name sub attribute datatype string;
name sub attribute,
datatype string;
job-title sub name;
date-started sub attribute datatype date;
date-started sub attribute,
datatype date;

ownership sub relationship,
relates owner,
relates property;

organisation sub entity,
plays member,
plays group,
plays organisational-group,
plays property,
plays owner,
plays party,
Expand All @@ -51,11 +53,11 @@ affiliation sub relationship,

membership sub affiliation,
relates member as party,
relates group as party;
relates organisational-group as party;

employment sub affiliation,
relates employee as member,
relates employer as group,
relates employer as organisational-group,
has job-title;

project sub entity,
Expand Down
6 changes: 3 additions & 3 deletions kglib/kgcn/use_cases/attribute_prediction/label_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
# specific language governing permissions and limitations
# under the License.
#

import itertools
import typing as typ
import kglib.kgcn.neighbourhood.data.sampling.random_sampling as random

Expand All @@ -29,9 +29,9 @@ def __init__(self, query: str, attribute_vars_config: typ.Tuple[str, typ.Mutable
self._attribute_vars_config = attribute_vars_config
self._query = query

def __call__(self, tx, sample_size):
def __call__(self, tx, sample_size, population_size):

response = tx.query(self._query)
response = itertools.islice(tx.query(self._query), population_size)
sampled_responses = self._sampling_method(response, sample_size)
owner_var = self._attribute_vars_config[0]

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,19 +85,19 @@ def get(variable):
def test_output_format_as_expected(self):

concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
concepts_with_labels = concept_label_extractor(self._grakn_tx, 5)
concepts_with_labels = concept_label_extractor(self._grakn_tx, 5, 5)

expected_output = [(self._person_mock, {'age_var': [66], 'gender_var': [0, 1]})]
self.assertListEqual(expected_output, concepts_with_labels)

def test_get_called_for_each_attribute_variable(self):
concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
concept_label_extractor(self._grakn_tx, 5)
concept_label_extractor(self._grakn_tx, 5, 5)
self._answer_mock.get.assert_has_calls([mock.call('x'), mock.call('age_var'), mock.call('gender_var')])

def test_value_called_for_each_attribute(self):
concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
concept_label_extractor(self._grakn_tx, 5)
concept_label_extractor(self._grakn_tx, 5, 5)
self._mock_age.value.assert_called_once()
self._mock_gender.value.assert_called_once()

Expand Down

0 comments on commit f171590

Please sign in to comment.