dbcanlight

Aug 7, 2023 · 1 min read
projects

Dbcanlight is a lightweight rewrite of a widely used CAZyme annotation tool run_dbcan. It uses pyhmmer, a Cython binding to HMMER3, in place of the HMMER3 CLI suite as the backend for search processes, improving multithreading performance. In addition, it removes a limitation in run_dbcan that required manual splitting of large sequence files beforehand.

The main program dbcanlight comprises three modules - build, search and conclude. The build module help to download the required databases from dbCAN website; the search module searches against protein HMM, substrate HMM or diamond databases and reports the hits separately; and the conclude module gathers all the results made by each module and provides a summary.

We benchmarked dbcanlight with a protein fasta with 14,574 sequences. 3 rounds of test were run on cazyme and substrate detection mode (--tools hmmer dbcansub in run_dbcan and -m cazyme and -m sub in dbcanlight). The performance tests show that the dbcanlight is approximately 3X faster than run_dbcan with acceptable 2 GB of RAM usage.

performance

If you’re interested in dbcanlight, please refer to the GitHub page for more details.

Cheng-Hung Tsai
Authors
I am a Bioinformatician with a PhD from UC Riverside (Dr. Jason Stajich Lab), specializing in the intersection of software development and large-scale genomics. My work focuses on building efficient UNIX/Python tools for genomics and metagenomics applications. I bring a unique perspective to the dry lab, having spent my early career at the bench mastering protein purification and molecular biology. I am passionate about creating user-friendly, scalable tools that empower researchers to turn raw sequencing data into biological discovery.