Toward Optimal Fingerprint Indexing for Large Scale Genomics - CRISTAL-BONSAI Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

Toward Optimal Fingerprint Indexing for Large Scale Genomics

Résumé

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.
Fichier principal
Vignette du fichier
LIPIcs-WABI-2022-25.pdf (1.28 Mo) Télécharger le fichier
Origine : Publication financée par une institution

Dates et versions

hal-04121819 , version 1 (08-06-2023)

Licence

Paternité

Identifiants

Citer

Clément Agret, Bastien Cazaux, Antoine Limasset. Toward Optimal Fingerprint Indexing for Large Scale Genomics. WABI 2022 - 22nd International Workshop on Algorithms in Bioinformatics, Sep 2022, Postdam, Germany. pp.25:1-25:15, ⟨10.4230/LIPIcs.WABI.2022.25⟩. ⟨hal-04121819⟩
57 Consultations
22 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More