Skip to main content
ARS Home » Midwest Area » Ames, Iowa » Corn Insects and Crop Genetics Research » Research » Publications at this Location » Publication #390650

Research Project: MaizeGDB: Enabling Access to Basic, Translational, and Applied Research Information

Location: Corn Insects and Crop Genetics Research

Title: ABRIDGE: An ultra-compression software for SAM alignment files

Author
item BANERJEE, SAGNIK - Iowa State University
item Andorf, Carson

Submitted to: bioRxiv
Publication Type: Pre-print Publication
Publication Acceptance Date: 1/4/2022
Publication Date: 1/5/2022
Citation: Banerjee, S., Andorf, C.M. 2022. ABRIDGE: An ultra-compression software for SAM alignment files. bioRxiv. https://doi.org/10.1101/2022.01.04.474935.
DOI: https://doi.org/10.1101/2022.01.04.474935

Interpretive Summary: High-throughput sequencing has become an essential tool for understanding a wide range of biological problems. These large sets of sequences enable whole-genome assembly and provide insights into how genes are expressed. However, the influx of sequence data has caused an increase in storage demands and a need for tools to make data transfer and data handling less cumbersome. To assist in this endeavor, we developed ABRIDGE, a compression tool for sequence alignments offering a wide range of options. Benchmarked against similar software, ABRIDGE had the best compression ratio with the lowest space demand and fastest file transmission speeds. Central to the software is a novel algorithm that retains non-redundant information. ABRIDGE can be adopted with existing workflows and pipelines, facilitating research across a wide variety of domains.

Technical Abstract: Advancement in technology has enabled sequencing machines to produce vast amounts of genetic data, causing an increase in storage demands. Most genomic software utilizes read alignments for several purposes including transcriptome assembly and gene count estimation. Herein we present, ABRIDGE, a state-of-the-art compressor for SAM alignment files offering users both lossless and lossy compression options. This reference-based file compressor achieves the best compression ratio among all compression software ensuring lower space demand and faster file transmission. Central to the software is a novel algorithm that retains non-redundant information. This new approach has allowed ABRIDGE to achieve a compression 16% higher than the second-best compressor for RNA-Seq reads and over 35% for DNA-Seq reads. ABRIDGE also offers users the option to randomly access location without having to decompress the entire file. ABRIDGE is distributed under MIT license and can be obtained from github, conda and docker hub. We anticipate that the user community will adopt ABRIDGE within their existing pipeline encouraging further research in this domain.