Skip to content

Autocycler compress

Ryan Wick edited this page Jan 20, 2025 · 26 revisions

Basics

The autocycler compress command uses your input assemblies to build a compacted De Bruijn graph (a.k.a. a unitig graph). This serves a few purposes. First, since the input assemblies are all quite similar (they are alternative assemblies of the same genome), the graph structure compresses them into a much smaller file. Second, the graph remembers the path of each input sequence, which means that the input assemblies can be fully recovered from this graph (see Autocycler decompress). Third, the graph structure provides a convenient way to manipulate the sequences in later steps, e.g. trimming overlap with Autocycler trim or building a consensus with Autocycler resolve.

Example command

autocycler compress -i assemblies -a autocycler_out

This command takes an input directory named assemblies which contains the input assemblies in FASTA format. It will create the autocycler directory which will contain the graph as file named input_assemblies.gfa.

Full usage

Usage: autocycler compress [OPTIONS] --assemblies_dir <ASSEMBLIES_DIR> --autocycler_dir <AUTOCYCLER_DIR>

Options:
  -i, --assemblies_dir <ASSEMBLIES_DIR>  Directory containing input assemblies (required)
  -a, --autocycler_dir <AUTOCYCLER_DIR>  Autocycler directory to be created (required)
      --kmer <KMER>                      K-mer size for De Bruijn graph [default: 51]
      --max_contigs <MAX_CONTIGS>        refuse to run if mean contigs per assembly exceeds this value
                                         [default: 25]
  -t, --threads <THREADS>                Number of CPU threads [default: 8]
  -h, --help                             Print help
  -V, --version                          Print version

Notes

  • While different k-mer sizes will produce different unitig graphs, the final result of Autocycler is not sensitive to this parameter, so there is usually no need to change it from the default value.
  • Since Autocycler decompress can be used to fully restore the input assemblies (with a couple caveats, see the Autocycler decompress page), you can delete the input assemblies after running Autocycler compress.
  • The --max_contigs option exists to catch obviously bad input data. If the mean number of contigs per input assemblies exceeds this value (default of 25), Autocycler compress will refuse to run and display an error message. For example, if you give Autocycler 10 input assemblies with a total of 1000 contigs, that is an average of 100 contigs per assembly, which almost certainly means that they are fragmented or contaminated and thus not appropriate for Autocycler.

Toy example

To create a simple example for Autocycler's documentation, I will use the following sequences: a.fasta, b.fasta, c.fasta and d.fasta. These are alternative assemblies of a toy genome, where the true genome has two circular sequences: 300 bp and 100 bp. This same toy example is continued on the pages for subsequent steps.

None of the four input assemblies are exactly correct. They contain the following errors:

  • a.fasta errors:
    • 6-bp deletion in contig a1
    • 18-bp inversion in contig a1
    • 1-bp insertion in contig a2
  • b.fasta errors:
    • contigs b1 and b2 failed to circularise cleanly (contain excess sequence creating start-end overlaps)
    • 1-bp substitution in contig b2
  • c.fasta errors:
    • two close 1-bp substitutions in contig c1
    • does not contain a contig for the smaller sequence in the genome
  • d.fasta errors:
    • contigs d1 and d2 failed to circularise cleanly (missing sequence creating start-end gaps)
    • 1-bp substitution in contig d1
    • contains an additional erroneous contig

Running autocycler compress on all four input assemblies produces a unitig graph:

autocycler compress

Note that since these toy sequences are unrealistically small, I used --kmer 13 to reduce the k-mer size.

The unitig graph contains structures for two main reasons. First, repeats collapse together, similar to what occurs in a short-read genome assembly. This is because most genomes will contain repeats longer than the k-mer sized used to build the graph, and these repeats will be resolved in a later step. The second reason for graph structure is the fact that the input assemblies are not exactly the same, and the differences create bubbles.

unitig graph features

Each of the eight input contigs follows a path (saved as a path line in the GFA file) through this graph:

input sequences paths

When looking at the graph, you might wonder why there are simple linear paths which could in principle be merged to simplify the graph's structure. These breaks provide places for each input contig to start and end. The paths shown above highlight this for this example graph.

Clone this wiki locally