-
Notifications
You must be signed in to change notification settings - Fork 5
Autocycler compress
The autocycler compress
command uses your input assemblies to build a compacted De Bruijn graph (a.k.a. a unitig graph). This serves a few purposes. First, since the input assemblies are all quite similar (they are alternative assemblies of the same genome), the graph structure compresses them into a much smaller file. Second, the graph remembers the path of each input sequence, which means that the input assemblies can be fully recovered from this graph (see Autocycler decompress). Third, the graph structure provides a convenient way to manipulate the sequences in later steps, e.g. trimming overlap with Autocycler trim or building a consensus with Autocycler resolve.
autocycler compress -i assemblies -a autocycler_out
This command takes an input directory named assemblies
which contains the input assemblies in FASTA format. It will create the autocycler
directory which will contain the graph as file named input_assemblies.gfa
.
Usage: autocycler compress [OPTIONS] --assemblies_dir <ASSEMBLIES_DIR> --autocycler_dir <AUTOCYCLER_DIR>
Options:
-i, --assemblies_dir <ASSEMBLIES_DIR> Directory containing input assemblies (required)
-a, --autocycler_dir <AUTOCYCLER_DIR> Autocycler directory to be created (required)
--kmer <KMER> K-mer size for De Bruijn graph [default: 51]
--max_contigs <MAX_CONTIGS> refuse to run if mean contigs per assembly exceeds this value
[default: 25]
-t, --threads <THREADS> Number of CPU threads [default: 8]
-h, --help Print help
-V, --version Print version
- While different k-mer sizes will produce different unitig graphs, the final result of Autocycler is not sensitive to this parameter, so there is usually no need to change it from the default value.
- Since Autocycler decompress can be used to fully restore the input assemblies (with a couple caveats, see the Autocycler decompress page), you can delete the input assemblies after running Autocycler compress.
- The
--max_contigs
option exists to catch obviously bad input data. If the mean number of contigs per input assemblies exceeds this value (default of 25), Autocycler compress will refuse to run and display an error message. For example, if you give Autocycler 10 input assemblies with a total of 1000 contigs, that is an average of 100 contigs per assembly, which almost certainly means that they are fragmented or contaminated and thus not appropriate for Autocycler.
To create a simple example for Autocycler's documentation, I will use the following sequences: a.fasta
, b.fasta
, c.fasta
and d.fasta
. These are alternative assemblies of a toy genome, where the true genome has two circular sequences: 300 bp and 100 bp. This same toy example is continued on the pages for subsequent steps.
None of the four input assemblies are exactly correct. They contain the following errors:
-
a.fasta
errors:- 6-bp deletion in contig
a1
- 18-bp inversion in contig
a1
- 1-bp insertion in contig
a2
- 6-bp deletion in contig
-
b.fasta
errors:- contigs
b1
andb2
failed to circularise cleanly (contain excess sequence creating start-end overlaps) - 1-bp substitution in contig
b2
- contigs
-
c.fasta
errors:- two close 1-bp substitutions in contig
c1
- does not contain a contig for the smaller sequence in the genome
- two close 1-bp substitutions in contig
-
d.fasta
errors:- contigs
d1
andd2
failed to circularise cleanly (missing sequence creating start-end gaps) - 1-bp substitution in contig
d1
- contains an additional erroneous contig
- contigs
Running autocycler compress
on all four input assemblies produces a unitig graph:
Note that since these toy sequences are unrealistically small, I used --kmer 13
to reduce the k-mer size.
The unitig graph contains structures for two main reasons. First, repeats collapse together, similar to what occurs in a short-read genome assembly. This is because most genomes will contain repeats longer than the k-mer sized used to build the graph, and these repeats will be resolved in a later step. The second reason for graph structure is the fact that the input assemblies are not exactly the same, and the differences create bubbles.
Each of the eight input contigs follows a path (saved as a path line in the GFA file) through this graph:
When looking at the graph, you might wonder why there are simple linear paths which could in principle be merged to simplify the graph's structure. These breaks provide places for each input contig to start and end. The paths shown above highlight this for this example graph.
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine