Pre release upgrade (#33)

* Upgrading to use Grakn commit 760c46dc1f19d572c2abaa61fc5a16bf4ced4312 * Upgrades kglib to use Grakn commit 760c46dc1f19d572c2abaa61fc5a16bf4ced4312, mostly requiring syntax changes and the temporary lack of limits to query result length * Minor changes to printing confusion matrices and properly passes attribute label values * Improves READMEs * Fix for label_extraction_test
typedb · Jan 25, 2019 · f171590 · f171590
1 parent 9699652
commit f171590
Show file tree

Hide file tree

Showing 11 changed files with 53 additions and 50 deletions.
diff --git a/README.md b/README.md
@@ -1,4 +1,5 @@
 # Research
 This repository is the centre of all research projects conducted at Grakn Labs. In particular, it's focus is on the integration of machine learning with the Grakn knowledge graph.
 
-Our first project is on [*Knowledge Graph Convolutional Networks* (KGCNs)](/kglib/kgcn).
+At present this repo contains one project: [*Knowledge Graph Convolutional Networks* (KGCNs)](/kglib/kgcn).
+
diff --git a/kglib/kgcn/README.md b/kglib/kgcn/README.md
@@ -1,45 +1,48 @@
 # Knowledge Graph Convolutional Networks (KGCNs)
 
-This project introduces a novel model: the Knowledge Graph Convolutional Network. The principal idea of this work is to build a bridge between knowledge graphs and machine learning. KGCNs can be used to create vector representations or *embeddings* of any labelled set of Grakn concepts. As a result, a KGCN can be trained directly for the classification or regression of Concepts stored in Grakn. Future work will include building embeddings via unsupervised learning.![KGCN Process](readme_images/KGCN_process.png)
+This project introduces a novel model: the *Knowledge Graph Convolutional Network* (KGCN). The principal idea of this work is to forge a bridge between knowledge graphs and machine learning, using [Grakn](https://github.com/graknlabs/grakn) as the knowledge graph. A KGCN can be used to create vector representations, *embeddings*, of any labelled set of Grakn Concepts via supervised learning. As a result, a KGCN can be trained directly for the classification or regression of Concepts stored in Grakn. Future work will include building embeddings via unsupervised learning.![KGCN Process](readme_images/KGCN_process.png)
 
 
 
 ## Methodology
 
-The ideology behind this project is described [here](https://blog.grakn.ai/knowledge-graph-convolutional-networks-machine-learning-over-reasoned-knowledge-9eb5ce5e0f68). The principles of the implementation are based on [GraphSAGE](http://snap.stanford.edu/graphsage/), from the Stanford SNAP group, made to work over a **knowledge graph**. Instead of working on a typical property graph, a KGCN learns from the context of a *typed hypergraph*, Grakn. Additionally, it learns from facts deduced by Grakn's *automated logical reasoner*. From this point on some understanding of [Grakn's docs](http://dev.grakn.ai) is assumed.
+The ideology behind this project is described [here](https://blog.grakn.ai/knowledge-graph-convolutional-networks-machine-learning-over-reasoned-knowledge-9eb5ce5e0f68). The principles of the implementation are based on [GraphSAGE](http://snap.stanford.edu/graphsage/), from the Stanford SNAP group, made to work over a **knowledge graph**. Instead of working on a typical property graph, a KGCN learns from the context of a *typed hypergraph*, **Grakn**. Additionally, it learns from facts deduced by Grakn's *automated logical reasoner*. From this point onwards some understanding of [Grakn's docs](http://dev.grakn.ai) is assumed.
 
-#### How does a KGCN work?
+#### How do KGCNs work?
 
-The purpose of this method is to derive embeddings for a set of Concepts (and thereby directly learn to classify them). We start by querying Grakn to find a set of examples with labels. Following that, we gather data about the neighbourhood of each example Concept. We do this by considering their *k-hop* neighbours.
+The purpose of this method is to derive embeddings for a set of Concepts (and thereby directly learn to classify them). We start by querying Grakn to find a set of labelled examples. Following that, we gather data about the neighbourhood of each example Concept. We do this by considering their *k-hop* neighbours.
 
-![Screenshot 2019-01-24 at 19.00.31](readme_images/k-hop_neighbours.png)We retrieve the data concerning this neighbourhood from Grakn. This includes information on the *types*, *roles*, and *attribute* values of each neighbour encountered.
+![k-hop neighbours](readme_images/k-hop_neighbours.png)We retrieve the data concerning this neighbourhood from Grakn. This information includes the *type hierarchy*, *roles*, and *attribute* values of each neighbouring Concept encountered.
 
-To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a 'summary' representation remains - an embedding. In our example these embeddings are directly optimised to perform multi-class classification via a single subsequent dense layer and softmax cross entropy.
+To create embeddings, we build a network in TensorFlow that successively aggregates and combines features from the K hops until a 'summary' representation remains - an embedding. In our example these embeddings are directly optimised to perform multi-class classification. This is achieved by passing the embeddings to a single subsequent dense layer and determining loss via softmax cross entropy with the labels retrieved.
 
-![Screenshot 2019-01-24 at 19.03.08](readme_images/aggregate_and_combine.png)
+![Aggregation and Combination process](readme_images/aggregate_and_combine.png)
 
 
 
-## Example - CITES Animal Trade Data
+## Usage by example - CITES Animal Trade Data
 
-#### Quickstart
+### Quickstart
 
 **Requirements:**
 
 - Python 3.6.3 or higher
-
 - kglib installed from pip: `pip install --extra-index-url https://test.pypi.org/simple/ grakn-kglib`
-- The `animaltrade` dataset from the latest release. This is a dataset that has been pre-loaded into Grakn v1.5 (so you don't have to run the data import yourself), with two keyspaces: `animaltrade_train` and `animaltrade_test`.
+- The `grakn-animaltrade.zip` dataset from the [latest release](https://github.com/graknlabs/kglib/releases/latest). This is a dataset that has been pre-loaded into Grakn v1.5 (so you don't have to run the data import yourself), with two keyspaces: `animaltrade_train` and `animaltrade_test`.
 
 **To use:**
 
 - Prepare the data:
 
   - If you already have an insatnce of Grakn running, make sure to stop it using `./grakn server stop`
+
+  - Download the pre-loaded Grakn distribution from the [latest release](https://github.com/graknlabs/kglib/releases/latest)
 
-  - Unzip the pre-loaded Grakn + dataset from the latest release, the location you store this in doesn't matter
+  - Unzip the distribution `unzip grakn-animaltrade.zip `, where you store this doesn't matter 
 
-  - `cd` into the dataset and start Grakn: `./grakn server start`
+  - cd into the distribution `cd grakn-animaltrade`
+
+  - start Grakn `./grakn server start`
 
   - Confirm that the training keyspace is present and contains data 
 
@@ -61,12 +64,12 @@ To create embeddings, we build a network in TensorFlow that successively aggrega
 
 The CITES dataset details exchanges of animal-based products between countries. In this example we aim to predict the value of `appendix` for a set of samples. This `appendix` can be thought of as the level of endangerment that a `traded-item` is subject to, where `1` represents the highest level of endangerment, and `3` the lowest.
 
-The `main` function will:
+The [main](examples/animal_trade/main.py) function will:
 
-- Search Grakn for 30 concepts (with an labels) to use as the training set, 30 for the evaluation set, and 30 for the prediction set using queries such as:
+- Search Grakn for 30 concepts (with attributes as labels) to use as the training set, 30 for the evaluation set, and 30 for the prediction set using queries such as (limiting the returned stream):
 
   ```
-  match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix 1; limit 30; get;
+  match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix 1; get;
   ```
 
   This searches for an `exchange` between countries that has an `appendix` (endangerment level) of `1`, and finds the `traded-item` that was exchanged

diff --git a/kglib/kgcn/examples/animal_trade/main.py b/kglib/kgcn/examples/animal_trade/main.py
@@ -48,12 +48,12 @@
 flags.DEFINE_integer('max_training_steps', 10000, 'Max number of gradient steps to take during gradient descent')
 
 # Sample selection params
-EXAMPLES_QUERY = 'match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix {}; ' \
-                 'limit {}; get;'
+EXAMPLES_QUERY = 'match $e(exchanged-item: $traded-item) isa exchange, has appendix $appendix; $appendix {}; get;'
 LABEL_ATTRIBUTE_TYPE = 'appendix'
+ATTRIBUTE_VALUES = [1, 2, 3]
 EXAMPLE_CONCEPT_TYPE = 'traded-item'
 
-NUM_PER_CLASS = 30
+NUM_PER_CLASS = 10
 POPULATION_SIZE_PER_CLASS = 1000
 
 # Params for persisting to files
@@ -125,8 +125,9 @@ def main(modes=(TRAIN, EVAL, PREDICT)):
                 PREDICT: {'sample_size': NUM_PER_CLASS, 'population_size': POPULATION_SIZE_PER_CLASS},
             }
             concepts, labels = samp_mgmt.compile_labelled_concepts(EXAMPLES_QUERY, EXAMPLE_CONCEPT_TYPE,
-                                                                   LABEL_ATTRIBUTE_TYPE, transactions[TRAIN],
-                                                                   transactions[PREDICT], sampling_params)
+                                                                   LABEL_ATTRIBUTE_TYPE, ATTRIBUTE_VALUES,
+                                                                   transactions[TRAIN], transactions[PREDICT],
+                                                                   sampling_params)
             prs.save_labelled_concepts(KEYSPACES, concepts, labels, SAVED_LABELS_PATH)
 
             samp_mgmt.delete_all_labels_from_keyspaces(transactions, LABEL_ATTRIBUTE_TYPE)

diff --git a/kglib/kgcn/examples/animal_trade/schema.gql b/kglib/kgcn/examples/animal_trade/schema.gql
@@ -139,11 +139,11 @@ define
             relates member-item,
             relates taxonomic-group;
 
-    taxonomic-ranking sub rule,
+    taxonomic-ranking
         when {
             (super-taxon: $a, sub-taxon: $b) isa taxonomic-hierarchy;
             (super-taxon: $b, sub-taxon: $c) isa taxonomic-hierarchy;
-        }
+        },
         then {
             (super-taxon: $a, sub-taxon: $c) isa taxonomic-hierarchy;
         };
@@ -152,7 +152,7 @@ define
         when {
             (member-item: $a, taxonomic-group: $taxon) isa taxon-membership;
             (sub-taxon: $taxon, super-taxon: $super) isa taxonomic-hierarchy;
-        }
+        },
         then {
             (member-item: $a, taxonomic-group: $super) isa taxon-membership;
         };
diff --git a/kglib/kgcn/management/samples.py b/kglib/kgcn/management/samples.py
@@ -31,14 +31,13 @@ def query_for_random_samples_with_attribute(tx, query, example_var_name, attribu
     labels = {}
 
     for a in attribute_vals:
-        target_concept_query = query.format(a, population_size)
+        target_concept_query = query.format(a)
 
         extractor = label_extraction.ConceptLabelExtractor(target_concept_query,
                                                            (example_var_name, collections.OrderedDict(
                                                                [(attribute_var_name, attribute_vals)])),
-                                                           sampling_method=random.random_sample
-                                                           )
-        concepts_with_labels = extractor(tx, sample_size_per_label)
+                                                           sampling_method=random.random_sample)
+        concepts_with_labels = extractor(tx, sample_size_per_label, population_size)
         if len(concepts_with_labels) == 0:
             raise RuntimeError(f'Couldn\'t find any concepts to match target query "{target_concept_query}"')
 
@@ -51,12 +50,13 @@ def query_for_random_samples_with_attribute(tx, query, example_var_name, attribu
     return concepts, labels
 
 
-def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_name, train_and_eval_transaction,
-                              predict_transaction, sampling_params):
+def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_name, attribute_values,
+                              train_and_eval_transaction, predict_transaction, sampling_params):
     """
     Assumes the case that data is partitioned into 2 keyspaces, one for training and evaluation, and another for
     prediction on unseen data (with labels). Therefore this function draws training and evaluation samples from the
     same keyspace.
+    :param attribute_values:
     :param samples_query: Query to use to select possible samples
     :param concept_var_name: The variable used for the example concepts within the `samples_query`
     :param attribute_var_name: The variable used for the samples' labels (attributes) within the `samples_query`
@@ -71,7 +71,7 @@ def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_nam
     print('    for training and evaluation')
     concepts_dicts, labels_dicts = \
         query_for_random_samples_with_attribute(train_and_eval_transaction, samples_query,
-                                                concept_var_name, attribute_var_name, [1, 2, 3],
+                                                concept_var_name, attribute_var_name, attribute_values,
                                                 sampling_params['train']['sample_size'] +
                                                 sampling_params['eval']['sample_size'],
                                                 sampling_params['train']['population_size'] +
@@ -81,7 +81,7 @@ def compile_labelled_concepts(samples_query, concept_var_name, attribute_var_nam
         query_for_random_samples_with_attribute(predict_transaction,
                                                 samples_query,
                                                 concept_var_name,
-                                                attribute_var_name, [1, 2, 3],
+                                                attribute_var_name, attribute_values,
                                                 sampling_params['predict']['sample_size'],
                                                 sampling_params['predict']['population_size'])
 

diff --git a/kglib/kgcn/models/downstream.py b/kglib/kgcn/models/downstream.py
@@ -142,8 +142,6 @@ def train(self, feed_dict):
                 print(f'\n-----')
                 print(f'Step {step}')
                 print(f'Loss: {loss_value:.2f}')
-                print(f'Confusion Matrix:')
-                print(confusion_matrix)
                 metrics.report_multiclass_metrics(labels_winners_values, predictions_class_winners_values)
         print("========= Training Complete =========\n\n")
 
@@ -157,8 +155,6 @@ def eval(self, feed_dict):
                  self._labels_winners])
 
         print(f'Loss: {loss_value:.2f}')
-        print(f'Confusion Matrix:')
-        print(confusion_matrix)
         metrics.report_multiclass_metrics(labels_winners_values, predictions_class_winners_values)
         print("========= Evaluation Complete =========\n\n")
 

diff --git a/kglib/kgcn/neighbourhood/data/executor.py b/kglib/kgcn/neighbourhood/data/executor.py
@@ -37,7 +37,7 @@ class TraversalExecutor:
     }
 
     ATTRIBUTE_QUERY = {
-        'query': 'match $thing id {} has attribute $attribute; get $attribute;',
+        'query': 'match $thing id {}, has attribute $attribute; get $attribute;',
         'variable': 'attribute'
     }
 

diff --git a/kglib/kgcn/neighbourhood/data/executor_test.py b/kglib/kgcn/neighbourhood/data/executor_test.py
@@ -120,7 +120,7 @@ def setUp(self):
 
 class TestTraversalExecutorFromDateAttribute(BaseTestTraversalExecutor.TestTraversalExecutor):
 
-    query = "match $attribute isa date-started 2015-11-12T00:00; limit 1; get;"
+    query = "match $attribute isa date-started; $attribute 2015-11-12T00:00; get;"
     var = 'attribute'
     roles = ['has']
     num_results = 1

diff --git a/kglib/kgcn/test_data/schema.gql b/kglib/kgcn/test_data/schema.gql
@@ -19,17 +19,19 @@
 
 define
 
-name sub attribute datatype string;
+name sub attribute,
+    datatype string;
 job-title sub name;
-date-started sub attribute datatype date;
+date-started sub attribute,
+    datatype date;
 
 ownership sub relationship,
     relates owner,
     relates property;
 
 organisation sub entity,
     plays member,
-    plays group,
+    plays organisational-group,
     plays property,
     plays owner,
     plays party,
@@ -51,11 +53,11 @@ affiliation sub relationship,
 
 membership sub affiliation,
     relates member as party,
-    relates group as party;
+    relates organisational-group as party;
 
 employment sub affiliation,
     relates employee as member,
-    relates employer as group,
+    relates employer as organisational-group,
     has job-title;
 
 project sub entity,

diff --git a/kglib/kgcn/use_cases/attribute_prediction/label_extraction.py b/kglib/kgcn/use_cases/attribute_prediction/label_extraction.py
@@ -16,7 +16,7 @@
 #  specific language governing permissions and limitations
 #  under the License.
 #
-
+import itertools
 import typing as typ
 import kglib.kgcn.neighbourhood.data.sampling.random_sampling as random
 
@@ -29,9 +29,9 @@ def __init__(self, query: str, attribute_vars_config: typ.Tuple[str, typ.Mutable
         self._attribute_vars_config = attribute_vars_config
         self._query = query
 
-    def __call__(self, tx, sample_size):
+    def __call__(self, tx, sample_size, population_size):
 
-        response = tx.query(self._query)
+        response = itertools.islice(tx.query(self._query), population_size)
         sampled_responses = self._sampling_method(response, sample_size)
         owner_var = self._attribute_vars_config[0]
 

diff --git a/kglib/kgcn/use_cases/attribute_prediction/label_extraction_test.py b/kglib/kgcn/use_cases/attribute_prediction/label_extraction_test.py
@@ -85,19 +85,19 @@ def get(variable):
     def test_output_format_as_expected(self):
 
         concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
-        concepts_with_labels = concept_label_extractor(self._grakn_tx, 5)
+        concepts_with_labels = concept_label_extractor(self._grakn_tx, 5, 5)
 
         expected_output = [(self._person_mock, {'age_var': [66], 'gender_var': [0, 1]})]
         self.assertListEqual(expected_output, concepts_with_labels)
 
     def test_get_called_for_each_attribute_variable(self):
         concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
-        concept_label_extractor(self._grakn_tx, 5)
+        concept_label_extractor(self._grakn_tx, 5, 5)
         self._answer_mock.get.assert_has_calls([mock.call('x'), mock.call('age_var'), mock.call('gender_var')])
 
     def test_value_called_for_each_attribute(self):
         concept_label_extractor = label_extraction.ConceptLabelExtractor(self._query, self._vars_config)
-        concept_label_extractor(self._grakn_tx, 5)
+        concept_label_extractor(self._grakn_tx, 5, 5)
         self._mock_age.value.assert_called_once()
         self._mock_gender.value.assert_called_once()