-
Notifications
You must be signed in to change notification settings - Fork 5
Metrics
During an assembly, Autocycler generates a number of metric-containing YAML files. These can be saved to a TSV file using Autocycler table or viewed directly.
This page contains information on all of Autocycler's metrics. See the Autocycler table page for a more concise list of key metrics (the default fields for the autocycler table
command).
It can be difficult to generalise about what constitutes 'good' or 'bad' values for many of these metrics, because they are dependent on the genome being assembled. However, if you are performing many assemblies of the same species, then outlier values could be red flags. For example, imagine that you performed an Autocycler assembly on each of 100 S. aureus genomes, and most of these had a compressed_unitig_count
of approximately 2000–4000 unitigs, but one genome produced 10000 unitigs – that might indicate a problem with that genome's data.
These metrics are created by Autocycler subsample and can be found in subsample.yaml
(one file for each assembly):
-
input_read_count
: the number of reads in the input read set (positive integer). -
input_read_bases
: the number of bases in the input read set (positive integer). -
input_read_n50
: the N50 read length for the input read set (positive integer). Read of this length and above contain 50% of the bases in the read set. -
output_reads
: the same details (count, bases and N50) for each of the output read sets.
These metrics are created by Autocycler compress and can be found in input_assemblies.yaml
(one file for each assembly):
-
input_assemblies_count
: the number of input assemblies used to build Autocycler's compacted De Bruijn graph (positive integer). -
input_assemblies_total_contigs
: the total number of contigs in all input assemblies (positive integer). -
input_assemblies_total_length
: the sum of the length of contigs in all input assemblies (positive integer). -
compressed_unitig_count
: the number of unitigs in Autocycler's compacted De Bruijn graph (positive integer). -
compressed_unitig_total_length
: the sum of the length of unitigs in Autocycler's compacted De Bruijn graph (positive integer). -
input_assembly_details
: details for each of the input assemblies:-
filename
: the input assembly filename (string). -
contigs
: details for each of the assembly's contigs:-
name
: the contig's name, i.e. text in the header up to the first space (string). -
description
the contig's description, i.e. text in the header after the first space (string). -
length
: the number of bases in the contig (positive integer).
-
-
These metrics are created by Autocycler cluster and can be found in clustering/clustering.yaml
(one file for each assembly):
-
pass_cluster_count
: the number of clusters which passed Autocycler cluster's QC (positive integer). This should ideally match the number of sequences in the genome. For example, if a genome has one chromosome and two plasmids, then an ideal value would be 3. -
fail_cluster_count
: the number of clusters which failed Autocycler cluster's QC (positive integer). Lower is better, but having some QC-fail clusters is normal and not a cause for concern. -
pass_contig_count
: the number of contigs in all of the QC-pass clusters (positive integer). This should ideally be close to the input assembly count times the pass cluster count, i.e. each input assembly produced one contig for each QC-pass cluster. However, it is often smaller, especially if the genome contains small plasmids (which are often omitted in long-read assemblies). -
fail_contig_count
: the number of contigs in all of the QC-fail clusters (positive integer). -
pass_contig_fraction
: the fraction of contigs which ended up in a QC-pass cluster (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the input assemblies were high quality and consistent. -
fail_contig_fraction
: the fraction of contigs which ended up in a QC-fail cluster (floating point from 0–1). This value and the previous value sum to 1. Ideally, this value should be low (close to zero). -
cluster_balance_score
: a value indicating how balanced the clustering was (floating point from 0–1). Ideally, this value should be high (close to one). A perfect score (1) indicates that each input assembly contributed one contig to each cluster. A low score indicates that input assemblies were uneven: contributing no contigs to some clusters and/or contributing multiple contigs to the same cluster. -
cluster_tightness_score
: a value indicating how tight the clustering was (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the sequences in each cluster are very similar to each other. A lower score indicates that some clusters have diverging sequences. See this Desmos plot for the relationship between a cluster's distance and its tightness score. This metric contains the means tightness score across all clusters. -
overall_clustering_score
: a mean of the previous two scores: balance and tightness (floating point from 0–1). Ideally, this value should be high (close to one), which indicates that the input assemblies were consistent and clustered well.
These metrics are created by Autocycler cluster and can be found in clustering/qc*/cluster_*/1_untrimmed.yaml
(one file for each cluster):
-
untrimmed_cluster_size
: the number of sequences in the cluster (positive integer). Ideally, this will be close to the number of input assemblies (i.e. each input assembly contributes one sequence to the cluster). -
untrimmed_cluster_lengths
: the lengths of all sequences in the cluster (list of positive integers). Ideally, these will all be close to each other, but since no trimming has yet occurred, there may be outliers due to overlaps. -
untrimmed_cluster_median
: the median sequence length for the cluster before trimming (positive integer). -
untrimmed_cluster_mad
: the median absolute deviation of sequence length for the cluster before trimming (positive integer). Lower is better, as this indicates consistent sequence lengths within the cluster. -
untrimmed_cluster_distance
: the maximum pairwise distance between sequences for the cluster. Lower is better, as this indicates tighter clusters.
These metrics are created by Autocycler trim and can be found in clustering/qc*/cluster_*/2_trimmed.yaml
(one file for each cluster):
-
trimmed_cluster_size
: the number of sequences in the cluster (positive integer). Ideally, this will be close to the number of input assemblies (i.e. each input assembly contributes one sequence to the cluster), but since Autocycler trim discards outlier sequences, it will often be lower than theuntrimmed_cluster_size
metric. -
trimmed_cluster_lengths
: the lengths of all sequences in the cluster (list of positive integers). Ideally, these will all be close to each other. -
trimmed_cluster_median
: the median sequence length for the cluster after trimming (positive integer). -
trimmed_cluster_mad
: the median absolute deviation of sequence length for the cluster after trimming (positive integer). Lower is better, as this indicates consistent sequence lengths within the cluster. Due to trimming and discarding of outliers, this will likely be lower thanuntrimmed_sequence_length_mad
.
These metrics are created by Autocycler combine and can be found in consensus_assembly.yaml
(one file for each assembly):
-
consensus_assembly_bases
: the total number of bases in the consensus assembly (positive integer). -
consensus_assembly_unitigs
: the total number of unitigs in the consensus assembly (positive integer). Ideally, this will have the same value aspass_cluster_count
, and if so, the following metric will be 'true'. -
consensus_assembly_fully_resolved
: whether or not each cluster has resolved to a single unitig (boolean). Will be 'true' if all clusters consist of only one unitig, 'false' if any of the clusters have more than one unitig. -
consensus_assembly_clusters
: details for each of the clusters:-
length
: the number of bases in the cluster (positive integer). -
unitigs
: the number of unitigs in the cluster (positive integer), ideally 1. -
topology
: the large-scale structure of the cluster. Will have one of the following values: 'circular' (one unitig with a circularising link), 'linear_blunt_blunt' (one unitig with two blunt ends, i.e. no links), 'linear_blunt_hairpin' (one unitig with a hairpin link on one end), 'linear_hairpin_hairpin' (one unitig with hairpin links on both ends), 'fragmented' (more than one unitig), 'other' (one unitig with unusual links, e.g. both circularising and hairpin). See the Linear sequences page for more info.
-
- Step 1: Autocycler subsample
- Step 2: Generating input assemblies
- Step 3: Autocycler compress
- Step 4: Autocycler cluster
- Step 5: Autocycler trim
- Step 6: Autocycler resolve
- Step 7: Autocycler combine