Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow TSS expressions to reference a full path through a metadata tree (possibly attached to every element of a phylogenetic tree) #26

Open
BenStoever opened this issue Aug 17, 2018 · 1 comment

Comments

@BenStoever
Copy link

BenStoever commented Aug 17, 2018

As mentioned in the wiki TSS is planned to support expressions that format elements of a tree depending on the values of annotations in a syntax that is similar to attribute selectors in CSS. I think that is a great feature and would allow to achieve formats like supported, e.g., by TreeGraph 2, to automatically set distance values or colors by annotations.

To make this really useful, I think it should be possible to reference the source annotation also in richly annotated trees, e.g., in NeXML format. Specifying a single identifier to reference an annotation may not be sufficient in a "metadata tree" attached, e.g., to a tree node or branch. I'm not sure if this was already considered, but I thought I will write down my thoughts on that here and ask for feedback.

Why is a single predicate or string key not enough?

Rich annotations in NeXML or phyloXML allow to nest annotations, therefore creating a "metadata tree" that is attached to each element of a phylogenetic tree (e.g., a node, a branch or the tree as a whole). Within such metadata trees the same predicate or identifier may be used multiple times underneath different parent elements to describe different things or may even be used multiple times on the same level to describe lists of data. The following random example illustrates this.

<?xml version="1.0" ?>
<nexml xmlns="http://www.nexml.org/2009" version="0.9"
		xmlns:nex="http://www.nexml.org/2009" xmlns:a="http://example.org/annotations/"
		xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	...
	<trees id="treeGroup" about="#treeGroup" otus="undefinedOTUs1">
		<tree id="tree" about="#tree" xsi:type="nex:FloatTree">
			...
			<node id="node2" about="#node2" label="Some angiosperm">
				<!-- Leaf size -->
				<meta id="node2meta1" xsi:type="nex:ResourceMeta" rel="a:leafSize">
					<meta id="node2meta2" xsi:type="nex:ResourceMeta" rel="a:measurements">
						<meta id="node2meta3" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">197</meta>
						<meta id="node2meta4" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">213</meta>
						<meta id="node2meta5" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">155</meta>
						<meta id="node2meta6" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">202</meta>
					</meta>
					<meta id="node2meta7" xsi:type="nex:LiteralMeta" property="a:averageSize" datatype="xsd:double">191.75</meta>
				</meta>
				
				<!-- Flower size -->
				<meta id="node2meta8" xsi:type="nex:ResourceMeta" rel="a:flowerSize">
					<meta id="node2meta9" xsi:type="nex:ResourceMeta" rel="a:measurements">
						<meta id="node2meta10" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">32</meta>
						<meta id="node2meta11" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">25</meta>
						<meta id="node2meta12" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">47</meta>
						<meta id="node2meta13" xsi:type="nex:LiteralMeta" property="a:size" datatype="xsd:double">28</meta>
					</meta>
					<meta id="node2meta14" xsi:type="nex:LiteralMeta" property="a:averageSize" datatype="xsd:double">33</meta>
				</meta>
			</node>
			...
		</tree>
	</trees>
</nexml>

The node (and all other terminal nodes in the document) have metadata attached that describe the leaf and flower sizes of collected samples. For both cases a metadata subtree is attached (a:leafSize and a:flowerSize) that contains a list of size measurements (a:measurements) and an average size element (a:averageSize). (This is just a random example that came to my mind. There are probably more reasonable uses cases.)

An expression like the following to set the font size by the average measured size would be ambiguous:

node {
	font-size: calc(a:averageSize * 0.1em);
}

How could a solution look like?

When refactoring the metadata model of TreeGraph 2 to match the NeXML model more closely, we figured "metadata paths" to be good references to a single annotation. (This figure from a currently unpublished chapter on TreeGraph 2 illustrates the concept of "metadata paths" through "metadata trees" visually on another example.)

The whole "metadata path" would be required to unambiguously reference an annotation:

node {
	font-size: calc(0.7em + metadataValue(currentNode, 'a:flowerSize', 'a:averageSize') * 0.01em);
}

I introduce a function metadataValue() here that takes the element of the phylogenetic tree as the first argument and a list of predicates as additional arguments to reference an attached value. For the first parameter a set of keyword (e.g., currentNode, currentBranch, ...) could be defined. This way, it is possible to reference any concrete attachment and node formats could also depend on branch annotations and vice versa.

Multiple equal predicates on the same level

To really model all possible cases, it must also be possible to select between multiple identical predicates used on the same level. In TreeGraph we solved this by adding an index to each predicate that defines the position on the current level. (Note that the index references the nth element with that predicate and not the position in the list. Intermediate elements with other predicates will not contribute to that index. This makes references more stable against edits of the metadata tree.)

node {
	font-size: calc(0.7em + metadataValue(currentNode, 'a:flowerSize', 'a:measurements', `a:size`[2]) * 0.01em);
}

This expression references the third (indices start with 0) measurement of the flower size which would be 47. (It should be discussed whether indices should start with 0 or 1. Maybe find a comparable case in CSS, e.g. column numbers.) The expression above would be short for metadataValue(currentNode, 'a:flowerSize'[0], 'a:measurements'[0], a:size[2]). 0 would be the default index.

How to deal with formats that do not support rich metadata?

Since TSS should be independent of the concrete tree format (virtual DOM) and predicates as in the example above are used in NeXML and to some extend in phyloXML but not, e.g. in Newick, there should also be a way do reference metadata there. In our library for reading and writing phylogenetic data and metadata from and to different formats using one common interface - JPhyloIO - we solved that problem by allowing to specify alternative string IDs of Newick hot comments (as used, e.g., by BEAT, MrBayes or TreeGraph) in addition.

If you think about the following tree in Newick format with metadata in hot comments that might alternatively be available:

( ... ("Some angiosperm"[&averageLeafSize=191.75, averageFlowerSize=33]:8.3, ... ));

A TSS that is able to format both trees might look like this:

node {
	font-size: calc(0.7em + metadataValue(currentNode, 'averageflowerSize') * 0.01em);
	font-size: calc(0.7em + metadataValue(currentNode, 'a:flowerSize', 'a:averageSize') * 0.01em);
}

Usually only one expression will give an result and will be used. In cases where both expressions produce a result (not possible in this concrete example in any format I'm aware of) the second would overwrite the first one as in CSS.

Further ideas

While I described my ideas to apply the concepts of NeXML, TreeGraph 2 and JPhyloIO to TSS expressions above, there are other cases the "metadata paths" as we use them in TreeGraph 2 would not cover.

Referencing depending on values of sibling or child elements of the metadata tree

<?xml version="1.0" ?>
<nexml xmlns="http://www.nexml.org/2009" version="0.9"
		xmlns:nex="http://www.nexml.org/2009" xmlns:a="http://example.org/annotations/"
		xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
	...
	<trees id="treeGroup" about="#treeGroup" otus="undefinedOTUs1">
		<tree id="tree" about="#tree" xsi:type="nex:FloatTree">
			...
			<node id="node2" about="#node2" label="Some angiosperm">
				...
			
				<!-- Flower size measurement in a population in Europe -->
				<meta id="node2meta1" xsi:type="nex:ResourceMeta" rel="a:collection">
					<meta id="node2meta2" xsi:type="nex:LiteralMeta" property="a:location" datatype="xsd:string">Europe</meta>
					<meta id="node2meta3" xsi:type="nex:LiteralMeta" property="a:averageSize" datatype="xsd:double">28</meta>
				</meta>
				
				<!-- Flower size measurement in a population in Africa -->
				<meta id="node2meta4" xsi:type="nex:ResourceMeta" rel="a:collection">
					<meta id="node2meta5" xsi:type="nex:LiteralMeta" property="a:location" datatype="xsd:string">Africa</meta>
					<meta id="node2meta6" xsi:type="nex:LiteralMeta" property="a:averageSize" datatype="xsd:double">32</meta>
				</meta>
			</node>
			...
		</tree>
	</trees>
</nexml>

If the measurement of the African population should be used to set the font size, the following would be possible:

node {
	font-size: calc(0.7em + metadataValue(currentNode, 'a:collection'[1], 'a:averageSize') * 0.01em);
}

This would though only work for documents that contain the collections in that exact order with no other entries in between them. It would be better to provide a way to reference the annotation based on its sibling element a:location='Africa'. Such more complex cases are probably not the focus right now or may even never be, but I wanted to note already that such use cases may exist. (The feature of TreeGraph 2 to calculate annotations from each other that also allows conditional expressions could, e.g., be used to preprocess a tree before applying TSS as a workaround, if TreeGraph 2 is used. Similar functionality probably exists in other systems.)

Referencing values by IDs

It may also make sense to select metadata values directly by their ID in some cases, as it is also possible in CSS. A problem here would be that not all formats have IDs and these may change when an application processed the file.

Accessing the average African flower size might also be done using the ID of the meta tag:

node {
	font-size: calc(0.7em + metadataValue(currentNode, #node2meta6) * 0.01em);
}

Purpose of this

I just wanted to document my ideas for this problem here, to make the solutions we already figured out für TreeGraph 2 and JPhyloIO available and illustrate their possible uses in TSS. Maybe some of these may become part of TSS or be the basis for further discussion.

Feedback on this (and possibly also on TreeGraph's upcoming new metadata model and the abstraction of JPhyloIO over different formats) is of course very welcome.

@BenStoever
Copy link
Author

BenStoever commented Aug 17, 2018

I just realizes that I started the last post talking about CSS attribute selectors and did not use them afterwards. Of course, the described priniples could be used in the same way for these:

.node[metadataValue(currentNode, 'a:flowerSize', 'a:averageSize') >= 30] {
	color: red;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant