ADAS:Advanced Database Search for Long Sequences

This crate (currently in development) is designed for searching long sequence databases, especially biological sequences. The core idea is from innovative applications of Minimizer, MinHash, Coreset and hierarchical navigable small world graphs (HNSW) algorithms.

Schematic overview of ADAS

Here we simply describe the algorithm:

build HNSW graph for all sequences as a database, the underlying algorithm will be Order MinHash, an LSH for Edit distance.
Search new sequences against the pre-built database. The same Order MinHash algorithm will be used.
For closest sequences to query seqeunces (e.g., top 100), seed-chain-exend and adaptive banded dynamic progrmming will be performed.

Install

Install from bioconda

conda install -c bioconda -c conda-forge adas
adas-build -h
adas-search -h

compile from source

wget https://github.com/jianshu93/adas/releases/download/v0.1.2/adas-Linux-x86-64_v0.1.2.zip
unzip adas-Linux-x86-64_v0.1.1.zip
chmod a+x ./adas-*
./adas-build -h

### Install Rust first if you do not have it
### For Linux
git clone https://github.com/jianshu93/adas
cd adas
cargo build --release
cd target/release


### for MacOS
### Install homebrew first here: https://brew.sh
brew install gcc
git clone https://github.com/jianshu93/adas
cd adas
CC="$(brew --prefix)/bin/gcc-14" CXX="$(brew --prefix)/bin/g++-14" cargo build --release
cd target/release

Usage

build HNSW database

adas-build -h

 ************** initializing logger *****************

MinHash sketching and Hierarchical Navigable Small World Graphs (HNSW) building for Long Sequences

Usage: adas-build [OPTIONS] --input <FASTA_FILE>

Options:
  -i, --input <FASTA_FILE>                    Input FASTA file
  -k, --kmer-size <KMER_SIZE>                 Size of k-mers, must be ≤14 [default: 8]
  -s, --sketch-size <SKETCH_SIZE>             Size of the sketch [default: 512]
  -t, --threads <THREADS>                     Number of threads for sketching [default: 1]
      --hnsw-capacity <HNSW_CAPACITY>         HNSW capacity parameter [default: 50000000]
      --hnsw-ef <HNSW_EF>                     HNSW ef parameter [default: 1600]
      --max_nb_connection <HNSW_MAX_NB_CONN>  HNSW max_nb_conn parameter [default: 256]
      --scale_modify_f <scale_modify>         scale modification factor in HNSW or HubNSW, must be in [0.2,1] [default: 1.0]
  -h, --help                                  Print help
  -V, --version                               Print version

Search pre-built HNSW database

adas-search -h

 ************** initializing logger *****************

Search Query Sequences against Pre-built Hierarchical Navigable Small World Graphs (HNSW) Index

Usage: adas-search [OPTIONS] --input <FASTA_FILE> --nbng <NB_SEARCH_ANSWERS> --hnsw <DATADIR>

Options:
  -i, --input <FASTA_FILE>        Input FASTA file
  -n, --nbng <NB_SEARCH_ANSWERS>  Number of search answers [default: 128]
  -b, --hnsw <DATADIR>            directory contains pre-built HNSW database files
  -t, --threads <THREADS>         Number of threads for sketching [default: 1]
  -h, --help                      Print help
  -V, --version                   Print version

Insert new sequences into HNSW database

./adas-insert -h
 ************** initializing logger *****************

Insert into Pre-built Hierarchical Navigable Small World Graphs (HNSW) Index

Usage: adas-insert [OPTIONS] --input <FASTA_FILE> --hnsw <DATADIR>

Options:
  -i, --input <FASTA_FILE>  Input FASTA file
  -b, --hnsw <DATADIR>      directory contains pre-built HNSW database files
  -t, --threads <THREADS>   Number of threads for sketching [default: 1]
  -h, --help                Print help
  -V, --version             Print version

Approximate alignment via chaining

adas-chain -h
Long Reads Alignment via Anchor Chaining

Usage: adas-chain [OPTIONS] --reference <REFERENCE_FASTA> --query <QUERY_FASTA> --output <OUTPUT_PATH>

Options:
  -r, --reference <REFERENCE_FASTA>  Reference FASTA file
  -q, --query <QUERY_FASTA>          Query FASTA file
  -t, --threads <THREADS>            Number of threads (default 1) [default: 1]
  -o, --output <OUTPUT_PATH>         Output path to write the results
  -h, --help                         Print help
  -V, --version                      Print version

Extrac closest seqeunces (or neighbors) for each sequence in a pre-build database

 ************** initializing logger *****************

Extract K Nearest Neighbors (K-NN) from HNSW graph, printing actual sequence IDs.

Usage: adas-knn [OPTIONS] --hnsw <DATADIR> --output <OUTPUT_PATH>

Options:
  -b, --hnsw <DATADIR>             Directory containing pre-built HNSW database files
  -o, --output <OUTPUT_PATH>       Output path to write the neighbor list (sequence IDs)
  -n, --k-nearest-neighbors <KNN>  Number of k-nearest-neighbors to extract [default: 32]
  -h, --help                       Print help
  -V, --version                    Print version

use real-world data

### build graph database from sequences, output in current folder (5 files)
./target/release/adas-build -i ./data/SAR11_cluster_centroid.fa -k 8 -s 128 -t 8 --max_nb_connection 128 --hnsw-ef 800 --scale_modify_f 0.25

### search query against per-built sequence database
./target/release/adas-search -i ./data/query.fasta -b . -n 50

### Insert new sequences into pre-built graph database, e.g., when there are new sequences to be added to the database. Current graph database files will be updated in current folder
./target/release/adas-insert -i ./data/test_16S_SAR11.fa -b . -t 8 

### extrac nearest sequences for each seqeunce in the database. distnance is Jaccard distance
./target/release/adas-knn -b . -n 32 -o adas.knn.txt

### Perform read alignment/overlap via seed-chain-extension, as in minimap2 (default overlap)
./target/release/adas-chain -q ./data/query.fasta -r ./data/SAR11_cluster_centroid.fa -t 8 -o chain.paf

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
benchmarking		benchmarking
data		data
src		src
.DS_Store		.DS_Store
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-MIT		LICENSE-MIT
README.md		README.md
adas.jpg		adas.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADAS:Advanced Database Search for Long Sequences

Schematic overview of ADAS

Install

Install from bioconda

compile from source

Usage

use real-world data

About

Releases 3

Packages

Contributors 2

Languages

License

jianshu93/adas

Folders and files

Latest commit

History

Repository files navigation

ADAS:Advanced Database Search for Long Sequences

Schematic overview of ADAS

Install

Install from bioconda

compile from source

Usage

use real-world data

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages