diff --git a/doc/tutorial/02_layout/index.md b/doc/tutorial/02_layout/index.md
new file mode 100644
index 00000000..6fba8379
--- /dev/null
+++ b/doc/tutorial/02_layout/index.md
@@ -0,0 +1,282 @@
+# Create a layout with Raptor {#tutorial_layout}
+
+You will learn how to construct a Raptor layout of large collections of nucleotide sequences.
+
+\attention
+This is a tutorial only if you want to use the advanced option `--hibf` for the `raptor index`.
+You can skip this chapter if you want to use raptor with the default IBF.
+
+\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
+[Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
+\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}
+
+[TOC]
+
+# IBF vs HIBF
+
+Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
+(HIBF) (raptor::hierarchical_interleaved_bloom_filter). This uses a more space-saving method of storing the bins. It
+distinguishes between the user bins, which reflect the individual samples as before, and the so-called technical bins,
+which throw some bins together. This is especially useful when there are samples of very different sizes.
+
+To use the HIBF, a layout must be created
+
+# Create a Layout of the HIBF
+
+To realise this distinction between user bins and technical bins, a layout must be calculated before creating an index.
+For this purpose we have developed our own tool [Chopper](https://www.seqan.de/apps/chopper.html) and integrated it into
+raptor. So you can simply call it up with `raptor layout` without having to install Chopper separately.
+
+## General Idea & main parameters
+
+\image html hibf.svg width=40%
+
+The figure above shows the storage of the user bins in the technical bins. The resulting tree represents the layout.
+The first step is to estimate the number of (representative) k-mers per user bin by computing
+[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
+sketches are stored in a directory and will be used in computing an HIBF layout. We will go into more detail later
+\ref HLL. The HIBF layout tries to minimize the disk space consumption of the resulting index. The space is estimated
+using a k-mer count per user bin which represents the potential denisity in a technical bin in an Interleaved Bloom
+filter.
+
+Using all default values a first call will look like:
+
+```bash
+raptor layout --input-file all_bin_path.txt --tmax 64
+```
+
+The `input-file` looks exactly as in our previous calls of `raptor index`; it contains all the paths of our database
+files.
+
+The parameter `--tmax` limits the number of technical bins on each level of the HIBF. Choosing a good \f$t_{max}\f$ is
+not trivial. The smaller \f$t_{max}\f$, the more levels the layout needs to represent the data. This results in a higher
+space consumption of the index. While querying each individual level is cheap, querying many levels might also lead to
+an increased runtime. A good \f$t_{max}\f$ is usually the square root of the number of user bins rounded to the next
+multiple of `64`. Note that your \f$t_{max}\f$ will be rounded to the next multiple of 64 anyway.
+
+\note
+At the expense of a longer runtime, you can enable the statistic mode that determines the best \f$t_{max}\f$ using the
+option `--determine-best-tmax`.
+When this flag is set, the program will compute multiple layouts for \f$t_{max}\f$ in `[64 , 128, 256, ... , tmax]` as
+well as `tmax = sqrt(number of user bins)`. The layout algorithm itself only optimizes the space consumption. When
+determining the best layout, we additionally keep track of the average number of queries needed to traverse each layout.
+This query cost is taken into account when determining the best \f$t_{max}\f$ for your data.
+Note that the option `--tmax` serves as upper bound. Once the layout quality starts dropping, the computation is
+stopped. To run all layout computations, pass the flag `--force-all-binnings`.
+The flag `--force-all-binnings` forces all layouts up to `--tmax` to be computed, regardless of the layout quality. If
+the flag `--determine-best-tmax` is not set, this flag is ignored and has no effect.
+
+We then get the resulting layout (default: `binning.out`) as an output file, which we then pass to Raptor to create the
+index. You can change this default with `--output-filename`.
+
+\note
+Raptor also has a help page, which can be accessed as usual by typing `raptor layout -h` or `raptor layout --help`.
+
+\assignment{Assignment 1: Create a first layout}
+Lets take the bigger example form the introduction \ref tutorial_first_steps and create a layout for it.
+```console
+$ tree -L 2 example_data
+example_data
+├── 1024
+│ ├── bins
+│ └── reads
+└── 64
+ ├── bins
+ └── reads
+```
+And use the data of the `1024` Folder.
+
+\hint
+First we need a file with all paths to the fasta files. For this use the command:
+```bash
+seq -f "1024/bins/bin_%04g.fasta" 0 1 1023 > all_bin_paths.txt
+```
+\endhint
+
+Then first determine the best tmax value and calculate a layout with default values and this tmax.
+\endassignment
+
+\solution
+Your `all_bin_paths.txt` should look like:
+```txt
+1024/bins/bin_0001.fasta
+1024/bins/bin_0002.fasta
+1024/bins/bin_0003.fasta
+...
+1024/bins/bin_1021.fasta
+1024/bins/bin_1022.fasta
+1024/bins/bin_1023.fasta
+```
+
+/note
+Sometimes it would be better to use the absolute paths instead.
+
+And you should have run:
+```bash
+raptor layout --input-file all_bin_paths.txt --determine-best-tmax --tmax 64
+```
+With the output:
+```bash
+## ### Parameters ###
+## number of user bins = 1023
+## number of hash functions = 2
+## false positive rate = 0.05
+## ### Notation ###
+## X-IBF = An IBF with X number of bins.
+## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
+## ### Column Description ###
+## tmax : The maximum number of technical bin on each level
+## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
+## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
+## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
+## (l*m)_tmax : Computed by l_tmax * m_tmax
+## size : The expected total size of an tmax-HIBF
+# tmax c_tmax l_tmax m_tmax (l*m)_tmax size
+64 1.00 2.00 1.00 2.00 12.8MiB
+# Best t_max (regarding expected query runtime): 64
+```
+And afterwards:
+```bash
+raptor layout --input-file all_bin_paths.txt --tmax 64
+```
+Your directory should look like this:
+```bash
+$ ls
+1024/ all_bin_paths.txt chopper_sketch.count mini/
+64/ binning.out chopper_sketch_sketches/
+```
+
+\note
+We will use this mini-example in the following, both with further parameters and then for `raptor index --hibf`.
+Therefore, we recommend not deleting the files including the built indexes.
+
+\endsolution
+
+## Additional parameters
+
+To create an index and thus a layout, the individual samples of the data set are chopped up into k-mers and determine in
+their so-called bin the specific bit setting of the Bloom Filter by passing them through hash functions. This means that
+a k-mer from sample `i` marks in bin `i` with `j` hash functions `j` bits with a `1`.
+If a query is then searched, its k-mers are thrown into the hash functions and looked at in which bins it only points
+to ones. This can also result in false positives. Thus, the result only indicates that the query is probably part of a
+sample.
+
+This principle also applies to the Hierarchical Interleaved Bloom Filter, except that the bins are then stored even more
+efficiently as described above and this is described by the layout. This means that you already have to know some
+parameters for the layout, which you would otherwise specify in the index:
+
+With `--kmer-size` you can specify the length of the k-mers, which should be long enough to avoid random hits.
+By using multiple hash functions, you can sometimes further reduce the possibility of false positives
+(`--num-hash-functions`). We found a useful [Bloom Filter Calculator](https://hur.st/bloomfilter/) to get a calculation
+if it could help. As it is not ours, we do not guarantee its accuracy.
+
+Each Bloom Filter has a bit vector length that, across all Bloom Filters, gives the size of the Interleaved Bloom
+Filter, which we can specify in the IBF case. Since the HIBF calculates the size of the index itself, it is no longer
+possible to specify a size here. But we can offer the option to name the desired false positive rate with
+`--false-positive-rate`.
+
+\note These parameters must be set identically for `raptor index`.
+
+A call could then look like this:
+```bash
+raptor layout --input-file all_bin_path.txt \
+ --tmax 64 \
+ --kmer-size 17 \
+ --num-hash-functions 4 \
+ --false-positive-rate 0.25 \
+ --output-filename binning.layout
+```
+
+### Parallelization
+
+Raptor supports parallelization. By specifying `--threads`, for example, the k-mer hashes are processed simultaneously.
+
+
+\assignment{Assignment 2: Create a more specific layout}
+Now lets run the above example with more parameters.
+
+Use the same `all_bin_paths.txt` and create a `binning2.out`. Take a kmer size of `16`, `3` hash functions, a false
+positive rate of `0.1` and use `2` threads.
+\endassignment
+
+\solution
+And you should have run:
+```bash
+raptor layout --input-file all_bin_paths.txt \
+ --tmax 64 \
+ --kmer-size 16 \
+ --num-hash-functions 3 \
+ --false-positive-rate 0.1 \
+ --threads 2 \
+ --output-filename binning2.layout
+```
+Your directory should look like this:
+```bash
+$ ls
+1024/ all_bin_paths.txt binning2.layout chopper_sketch_sketches/
+64/ binning.out chopper_sketch.count mini/
+```
+\endsolution
+
+### HyperLogLog sketch {#HLL}
+
+The first step is to estimate the number of (representative) k-mers per user bin by computing
+[HyperLogLog (HLL) sketches](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) of the input data. These HLL
+sketches are stored in a directory and will be used in computing an HIBF layout.
+
+We will also give a short explanation of the HLL sketches here to explain the possible parameters, whereby each bin is
+sketched individually.
+
+\note
+Most parameters are advanced and only need to be changed if the calculation takes significantly too long or the memory
+usage is too high.
+
+So the question is how many elements of our multiset are identical?
+With exact calculation, we need more storage space and runtime than with the HLL estimate.
+So, to find this out, we form (binary) 64 bit hash values of the data. These are equally distributed over all possible
+hash values. If you go through this hashed data, you can then estimate how many different elements you have seen so far
+only by reading leading zeros. (For the i'th element with `p` leading zeros, it is estimated that you have seen
+\f$2^p\f$ different elements). You then simply save the maximum of these (\f$2^{p_{max}}\f$ different elements).
+
+
+
+However, if we are unlucky and come across a hash value that consists of only `0`'s, then \f$p_{max}\f$ is of course
+maximum of all possible hash values, no matter how many different elements are actually present.
+To avoid this, we cut each hash value into `m` parts and calculate the \f$p_{max}\f$ over each of these parts. From
+these we then calculate the *harmonic mean* as the total \f$p_{max}\f$.
+
+We can influence this m with `--sketch-bits`. `m` must be a power of two so that we can divide the `64` bit evenly, so
+we use `--sketch-bits` to set a `b` with \f$m = 2^b\f$.
+
+If we choose our `b` (`m`) to be very large, then we need more memory but get higher accuracy. (Storage consumption is
+growing exponentially.) In addition, calculating the layout can take longer with a high `b` (`m`). If we have many user
+bins and observe a long runtime, then it is worth choosing a somewhat smaller `b` (`m`).
+
+#### Advanced options for HLL sketches
+
+The following options should only be touched if the calculation takes a long time.
+
+We have implemented another preprocessing that summarises the technical bins even better with regard to the similarities
+of the input data. This can be switched off with the flag `--skip-similarity-preprocessing` if it costs too much
+runtime.
+
+With `--max-rearrangement-ratio` you can further influence a part of the preprocessing (value between `0` and `1`). If
+you set this value to `1`, it is switched off. If you set it to a very small value, you will also need more runtime and
+memory. If it is close to `1`, however, just little re-arranging is done, which could be bad. In our benchmarks, however,
+we were not able to determine a too great influence, so we recommend that this value only be used for fine tuning.
+
+One last observation about these advanced options: If you expect hardly any similarity in the data set, then the
+similarity preprocessing makes very little difference.
+
+### Another advanced Option: alpha
+
+You should only touch the parameter `--alpha` if you have understood very well how the layout works and you are
+dissatisfied with the resulting index, e.g. there is still a lot of space in RAM but the index is very slow.
+
+The layout algorithm optimizes the space consumption of the resulting HIBF but currently has no means of optimizing the
+runtime for querying such an HIBF. In general, the ratio of merged bins and split bins influences the query time because
+a merged bin always triggers another search on a lower level. To influence this ratio, alpha can be used.
+
+Alpha is a parameter for weighting the storage space calculation of the lower-level IBFs. It functions as a lower-level
+penalty, i.e. if alpha is large, the DP algorithm tries to avoid lower levels, which at the same time leads to the
+top-level IBF becoming somewhat larger. This improves query times but leads to a bigger index.
diff --git a/doc/tutorial/03_index/index.md b/doc/tutorial/03_index/index.md
index 87c0c881..b04d67ce 100644
--- a/doc/tutorial/03_index/index.md
+++ b/doc/tutorial/03_index/index.md
@@ -2,7 +2,7 @@
You will learn how to construct a Raptor index of large collections of nucleotide sequences.
-\tutorial_head{Easy, 30 min, \ref tutorial_first_steps,
+\tutorial_head{Easy, 30 min, \ref tutorial_first_steps \, for using HIBF: \ref tutorial_layout,
[Interleaved Bloom Filter (IBF)](https://docs.seqan.de/seqan/3-master-user/classseqan3_1_1interleaved__bloom__filter.html#details)\,
\ref raptor::hierarchical_interleaved_bloom_filter "Hierarchical Interleaved Bloom Filter (HIBF)"}
@@ -41,7 +41,7 @@ raptor build --size 1k --output raptor.index all_bin_paths.txt
```
\assignment{Assignment 1: Create example files and index them}
-Copy and paste the following FASTA files to some location, e.g. the tmp directory:
+Copy and paste the following FASTA files to some location, e.g. a tmp directory:
mini_1.fasta:
```fasta
@@ -192,7 +192,7 @@ tmp$ ls -la
... 107B 14 Nov 16:11 mini_1.fasta
... 107B 14 Nov 16:11 mini_2.fasta
... 107B 14 Nov 16:11 mini_3.fasta
-... 8,0M 14 Nov 16:21 minimiser_raptor.index
+... 8,0M 14 Nov 16:21 minimiser.index
... 256B 14 Nov 16:20 precomputed_minimisers/
... 19B 10 Nov 16:40 query.fasta
... 1,2K 14 Nov 16:15 raptor.index
@@ -237,43 +237,83 @@ does not need any special parameters.
## IBF vs HIBF {#hibf}
-/todo This is a placeholder section and needs more information including the chopper layout.
-
Raptor works with the Interleaved Bloom Filter by default. A new feature is the Hierarchical Interleaved Bloom Filter
-(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf` can be used. This uses a more
+(HIBF) (raptor::hierarchical_interleaved_bloom_filter), which can be used with `--hibf`. This uses a more
space-saving method of storing the bins. It distinguishes between the user bins, which reflect the individual samples as
before, and the so-called technical bins, which throw some bins together. This is especially useful when there are
samples of very different sizes.
+
+To use the HIBF, a layout must be created before creating an index. We have written an extra tutorial for this
+\ref tutorial_layout.
+
+### HIBF indexing with the use of the layout
+
+The layout replaces the `all_bin_path.txt` and is given instead with the HIBF parameter: `--hibf binning.out`.
+
Since the HIBF calculates the size of the index itself, it is no longer possible to specify a size here. But we can
-offer the option to name the desired false positive rate with `--fpr`.
+offer the option to name the desired false positive rate with `--fpr`. Thus, for example, a call looks like this:
```bash
-raptor build --kmer 19 --hash 3 --hibf --fpr 0.1 --output raptor.index all_bin_paths.txt
+raptor build --hibf binning.layout \
+ --kmer 16 \
+ --window 20 \
+ --hash 3 \
+ --fpr 0.25 \
+ --threads 2 \
+ --output hibf.index
+```
+
+\assignment{Assignment 4: A default HIBF}
+Since we cannot see the advantages of the hibf with our small example. And certainly not the differences when we change
+the parameters. Let's not go back to our small example from above, but to the one from the introduction:
+
+```console
+$ tree -L 2 example_data
+example_data
+├── 1024
+│ ├── bins
+│ └── reads
+└── 64
+ ├── bins
+ └── reads
```
+And use the data of the `1024` Folder and the two layouts we've created in the \ref tutorial_layout tutorial :
+`binning.out` and `binning2.layout`.
+Lets use the HIBF with the default parameters and call the new indexes `hibf.index` and `hibf2.index`.
\note
-For a detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the
-`raptor::hierarchical_interleaved_bloom_filter` API.
+Your kmer size, number of hash functions and the false positive rate must be the same as in the layout.
-\assignment{Assignment 4: HIBF}
-Lets use the HIBF for our small example. Thus run the example above with a false positive rate of 0.05 and call the new
-index `hibf.index`.
-As our example is small, we will keep the kmer size of 4
\hint
-...
+For the second layout we took a kmer size of `16`, `3` hash functions and a false positive rate of `0.1`.
\endhint
\endassignment
\solution
You should have run:
```bash
-raptor build --kmer 4 --hibf --fpr 0.05 --output hibf.index all_paths.txt
+raptor build --hibf binning.out --output hibf.index
+raptor build --hibf binning2.layout --kmer 16 --hash 3 --fpr 0.1 --output hibf2.index
```
-/todo Currently not working: `[Error] The list of input files cannot be empty.`
Your directory should look like this:
```bash
-tmp$ ls -la
-...
+$ ls -la
+... 384B 12 Dez 13:44 ./
+... 544B 12 Dez 12:14 ../
+... 128B 23 Sep 2020 1024/
+... 128B 23 Sep 2020 64/
+... 25K 12 Dez 12:52 all_bin_paths.txt
+... 37K 12 Dez 12:56 binning.out
+... 37K 12 Dez 13:26 binning2.layout
+... 55K 12 Dez 13:26 chopper_sketch.count
+... 32K 12 Dez 12:52 chopper_sketch_sketches/
+... 13M 12 Dez 13:44 hibf.index
+... 7,2M 12 Dez 13:44 hibf2.index
+... 64B 8 Dez 16:24 mini/
```
\endsolution
+
+\note
+For a more detailed explanation of the Hierarchical Interleaved Bloom Filter (HIBF), please refer to the
+`raptor::hierarchical_interleaved_bloom_filter` API.
diff --git a/doc/tutorial/04_search/index.md b/doc/tutorial/04_search/index.md
index b2f0d54b..3cd5e9b8 100644
--- a/doc/tutorial/04_search/index.md
+++ b/doc/tutorial/04_search/index.md
@@ -124,8 +124,6 @@ and apply a thresholding step to determine the membership of a query in a bin.
The term representative indicates that the k-mer content could be transformed by a function which reduces its size and
distribution, e.g. using minimizers.
-\todo Mehr auf das image eingehen.
-
There are also a few parameters for the search, which we will now explain.
Instead of an exact search, we can also allow errors in the query, this we achieve with a positive value of `--error`.
@@ -351,18 +349,31 @@ If no minimizers are used, our thresholding ensures that the (H)IBF gives no fal
there are few false negatives, because the minimizer compression is not lossless.
\assignment{Assignment 4: Search with hibf.}
-We want to use the `hibf.index` from the index tutorial assignment again and use ...
+We want to use the `hibf.index` from the index tutorial assignment again.
-Lets search now with ..., with creating a `search6.output`.
+Lets search now for the `1024/reads/mini.fastq` querries with creating a `search6.output`.
\endassignment
\solution
You should have run
```bash
-raptor search --index hibf.index --query query.fasta --output search6.output
+raptor search --hibf --index hibf.index --query 1024/reads/mini.fastq --output search6.output
```
Your `search6.output` should look like:
```text
+#0 1024/bins/bin_0712.fasta
+#1 1024/bins/bin_0406.fasta
+...
+#1021 1024/bins/bin_0533.fasta
+#1022 1024/bins/bin_0624.fasta
+#QUERY_NAME USER_BINS
+0
+1
...
+1047555
+1047556
```
+\note
+You can also calculate the second hibf index B and you will see that they will give slightly different results
+(`diff search6.output search7.output`).
\endsolution