Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial: Sort and layout power ups. #552

Merged
merged 8 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@
# -- Project information -----------------------------------------------------

project = u'odgi'
copyright = '2020-2023, *Guarracino A., *Heumos S., Nahnsen S., Prins P., Garrison E. Revision v0.8.2-1fa78aa'
copyright = '2020-2024, *Guarracino A., *Heumos S., Nahnsen S., Prins P., Garrison E. Revision v0.8.4-a19163ea'
author = u'*Andrea Guarracino, *Simon Heumos, Sven Nahnsen, Pjotr Prins, Erik Garrison'

# The short X.Y version
version = 'v0.8.2'
version = 'v0.8.4'
# The full version, including alpha/beta/rc tags
release = '1fa78aa'
release = 'a19163ea'


# -- General configuration ---------------------------------------------------
Expand Down
Binary file added docs/img/DRB1-3123_sorted.U1000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/DRB1-3123_sorted.j10000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/DRB1-3123_sorted.x2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/DRB1-3123_sorting_layouting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ Core Functionalities
:target: rst/tutorials/extract_selected_loci.html

.. |sorting_layouting| image:: img/DRB1-3123_sorting_layouting.png
:target: rst/tutorials/sorting_layouting.html
:target: rst/tutorials/sort_layout.html

.. |navigating_and_annotating_graphs| image:: img/nav_welcome.png
:target: rst/tutorials/navigating_and_annotating_graphs.html
Expand Down
5 changes: 2 additions & 3 deletions docs/rst/commands/odgi_sort.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,8 @@ order:
force-directed graph drawing algorithm minimizes the graph’s energy
function or stress level. It applies stochastic gradient descent
(SGD) to move a single pair of nodes at a time. The path index is
used to pick the terms to move stochastically. If ran with 1 thread
only, the resulting order of the graph is deterministic. The seed is
adjustable.
used to pick the terms to move stochastically. For more details about
the algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2.

Sorting the paths in a graph my refine the sorting process. For the
users’ convenience, it is possible to specify a whole pipeline of sorts
Expand Down
6 changes: 3 additions & 3 deletions docs/rst/multiqc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ To see the full statistics in YAML format of the graph, execute:

.. code-block:: bash

odgi stats -i DRB1-3123.gfa.og -m
odgi stats -i DRB1-3123.gfa.og -m -sgdl

This prints the following YAML to stdout:

Expand Down Expand Up @@ -89,7 +89,7 @@ Let's save the statistics this time:

.. code-block:: bash

odgi stats -i DRB1-3123.gfa.og -m > DRB1-3123.gfa.og.stats.yaml
odgi stats -i DRB1-3123.gfa.og -m -sgdl > DRB1-3123.gfa.og.stats.yaml

.. note::

Expand Down Expand Up @@ -167,7 +167,7 @@ Assuming, we have several graphs, of which we want to compare the statistics fro
.. code-bock:: bash

odgi build -g LPA.gfa -o LPA.gfa.og
odgi stats -i LPA.gfa.og -y > LPA.gfa.og.stats.yaml
odgi stats -i LPA.gfa.og -m -sgdl > LPA.gfa.og.stats.yaml
odgi viz -i LPA.gfa.og -o LPA.gfa.og.viz_mqc.png
odgi layout -i LPA.gfa.og -o LPA.gfa.og.lay
odgi draw -i LPA.gfa.og -c LPA.gfa.og.lay -p LPA.gfa.og.lay.draw_mqc.png -w 10 -C
Expand Down
6 changes: 3 additions & 3 deletions docs/rst/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ version 1 (`GFAv1 <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8006571/#FN8>`_)
Build graph from GFA
----------------------------

Assuming that your current working directory is the root of the ``odgi`` project, to construct an ``odgi`` file from a
``GFA`` file, execute:
To construct an ``odgi`` file from a ``GFA`` file, execute:

.. code-block:: bash

odgi build -g test/DRB1-3123.gfa -o DRB1-3123.og
wget https://raw.githubusercontent.com/pangenome/odgi/master/test/DRB1-3123.gfa
odgi build -g DRB1-3123.gfa -o DRB1-3123.og

The command creates a file called ``DRB1-3123.og``, which contains the input graph in ``odgi`` format.

Expand Down
2 changes: 1 addition & 1 deletion docs/rst/tutorials/exploratory_analysis.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Color with respect to the node position

This is a linearized visualization, but the pangenome graphs are not linear when the embedded genomes present structural
variation. However, a graph can be optimized for being better visualized in 1-Dimension by sorting its nodes properly
(see the :ref:`sorting-layouting` tutorial for more information).
(see the :ref:`sort-layout` tutorial for more information).

To color the bars with respect to the node position in each path, execute:

Expand Down
112 changes: 108 additions & 4 deletions docs/rst/tutorials/sort_layout.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _sorting-layouting:
.. _sort-layout:

###############
Sort and Layout
Expand All @@ -16,6 +16,8 @@ a 1D and 2D layout to simplify these complex regions.
This tutorial shows how to sort and visualize a graph in 1D. It explains how to generate a 2D layout of a graph, and how
to take a look at the calculated layout using static and interactive tools.

For more details about the applied algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2.

.. Pangenome graphs embed linear pangenomic sequences as paths in
.. the graph, but to our knowledge, no algorithm takes into account this biological information in the sorting. Moreover,
.. existing 2D layout methods struggle to deal with large graphs. ``odgi`` implements a new layout algorithm to simplify a pangenome
Expand All @@ -39,12 +41,12 @@ to take a look at the calculated layout using static and interactive tools.
Build the unsorted DRB1-3123 graph
----------------------------------

Assuming that your current working directory is the root of the ``odgi`` project, to construct an ``odgi`` graph from the
``DRB1-3123`` dataset in ``GFA`` format, execute:
To construct an ``odgi`` graph from the ``DRB1-3123`` dataset in ``GFA`` format, execute:

.. code-block:: bash

odgi build -g test/DRB1-3123_unsorted.gfa -o DRB1-3123_unsorted.og
wget https://raw.githubusercontent.com/pangenome/odgi/master/test/DRB1-3123_unsorted.gfa
odgi build -g DRB1-3123_unsorted.gfa -o DRB1-3123_unsorted.og

The command creates a file called ``DRB1-3123_unsorted.og``, which contains the input graph in ``odgi`` format. This graph contains
12 ALT sequences of the `HLA-DRB1 gene <https://www.ncbi.nlm.nih.gov/gene/3123>`_ from the GRCh38 reference genome.
Expand Down Expand Up @@ -129,6 +131,22 @@ nodes.

.. note::
The PG-SGD is not deterministic, because of its `Hogwild! <https://papers.nips.cc/paper/2011/hash/218a0aefd1d1a4be65601cc6ddc1520e-Abstract.html>`_ approach.
For more details about the applied algorithm, please take a look at https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2.

.. note::
The 1D PG-SGD implementation comes with a huge amount of tunable parameters. Based on our experience applying it to hundreds of graphs, the current
defaults usually work well for most graphs. However, if you feel the sorting did not work well enough, there are 2 key parameters one can tune:

| **-G, --path-sgd-min-term-updates-paths**\ =\ *N*: The minimum number of terms to be
updated before a new path-guided
linear 1D SGD iteration with adjusted
learning rate eta starts, expressed as
a multiple of total path steps (default: 1.0).
| **-x, --path-sgd-iter-max**\ =\ *N*: The maximum number of iterations for path-guided linear 1D SGD model (default: 100).

Increasing both can lead to a better sorted graph. For example, one can start optimizing with setting **-x, --path-sgd-iter-max**\ =\ *200*.
For more parameter details please take
a look at :ref:`odgi sort`.

.. To reproduce the visualization below, the sorted graph can be found under ``test/DRB1-3123_sorted.og``.

Expand Down Expand Up @@ -169,6 +187,73 @@ This prints to stdout:

Compared to before, these metrics show that the goodness of the sorting of the graph improved significantly.

--------------------------------------------
Playing around with the 1D PG-SGD parameters
--------------------------------------------

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What happens if the maximum number of iterations is very low?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -x 2 -o DRB1-3123_sorted.x2.og
odgi viz -i DRB1-3123_sorted.x2.og -o DRB1-3123_sorted.x2.png

.. image:: /img/DRB1-3123_sorted.x2.png

The graph appears very complex and not quite human readable. That's because in total there were two times the number
of total path steps node position updates instead of one hundred times the number of total path steps, which is the current default.
For very complex graphs, one may have to increase this number even further.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What happens if the minimum number of term updates is very high?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -U 1000 -o DRB1-3123_sorted.U1000.og
odgi viz -i DRB1-3123_sorted.U1000.og -o DRB1-3123_sorted.U1000.png

.. image:: /img/DRB1-3123_sorted.U1000.png

The graph lost it's complexity and is now linear. Compared to the 1D visualization using the default parameters, it is hard
to spot any differences. So let's take a look at the metrics:

.. code-block:: bash

odgi stats -i DRB1-3123_sorted.U1000.og -s -d -l -g

This prints to stdout:

.. code-block:: bash

#mean_links_length
path in_node_space in_nucleotide_space num_links_considered num_gap_links_not_penalized
all_paths 1.00361 8.30677 21870 15195
#sum_of_path_node_distances
path in_node_space in_nucleotide_space nodes nucleotides num_penalties num_penalties_different_orientation
all_paths 3.23238 3.73489 21882 163416 3750 1

We actually were able to improve the metrics compared to using default parameters. However, the runtime increased from under 1 second to ~30 seconds.
So one needs to be careful with such a parameter. Compared to the gains in linearity, such an additional time usage would be a huge
waste with very large graphs.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What happens if the threshold of the maximum distance of two nodes is very high?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: bash

odgi sort -i DRB1-3123_unsorted.og --threads 2 -P -Y -j 10000 -o DRB1-3123_sorted.j10000.og
odgi viz -i DRB1-3123_sorted.j10000.og -o DRB1-3123_sorted.j10000.png

.. image:: /img/DRB1-3123_sorted.j10000.png

The graph appears very complex and not quite human readable. That's because the iterations are terminated as soon as the
expected distance of two nodes, the nucleotide distance given by two randomly chosen path steps, is as close as 10000.
Naturally, this happens very soon.

=========================================================
1D reference-guided grooming and reference-guided sorting
=========================================================
Expand Down Expand Up @@ -267,6 +352,8 @@ We can clearly observe, that the path positions of the two reference now define
2D layout
=========

The 2D PG-SGD layout algorithm is described in https://www.biorxiv.org/content/10.1101/2023.09.22.558964v2.

-----------------------------------------
2D layout of the unsorted DRB1-3123 graph
-----------------------------------------
Expand All @@ -277,6 +364,23 @@ We want to have a 2D layout of our DRB1-3123 graph:

odgi layout -i DRB1-3123_unsorted.og -o DRB1-3123_unsorted.og.lay -P --threads 2

.. note::
The 2D PG-SGD implementation comes with a huge amount of tunable parameters. Based on our experience applying it to hundreds of graphs, the current
defaults usually work well for most graphs. However, if you feel the resulting 2D layout is not of a good enough quality, there are 2 key parameters one can tune:

| **-G, --path-sgd-min-term-updates-paths**\ =\ *N*: Minimum number of terms N to be
updated before a new path-guided 2D
SGD iteration with adjusted learning
rate eta starts, expressed as a
multiple of total path length
(default: 10).
| **-x, --path-sgd-iter-max**\ =\ *N*: The maximum number of iterations N for
the path-guided 2D SGD model (default:
30).

Increasing both can lead to a better graph layout. For example, one can start optimizing with setting **-x, --path-sgd-iter-max**\ =\ *100*.
For more parameter details please take a look at :ref:`odgi layout`.

--------------------------------------------
Drawing the 2D layout of the DRB1-3123 graph
--------------------------------------------
Expand Down
Loading