Weizhong Li, Adam Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, Volume 22, Issue 13, July 2006, Pages 1658–1659, https://doi.org/10.1093/bioinformatics/btl158

Diajoint is an hydrophilic rubber joint made out of bentonite polymer, to be used as water stopper for concrete casting joints of concrete structures. The joint is composed by 75% of sodic bentonite and 25% of organic binder.

Since its release, cd-hit has been used by many groups, including Uniprot (Apweiler et al., 2004) and PDB (Bourne et al., 2004), in various research fields. In our group, we applied it to generate non-redundant protein datasets to reduce the database search efforts and to improve the homology detection sensitivity (Li et al., 2002a).

Appearance: Kerb Expansion in water after 14 days: > 425% Swelling with prevented expansion: 1.43 N/mm2 // 207.4 lb/in2 (psi) Elongation at break: 141 % Maximum deflection: no tearing at an angle of 180°

The cd-hit package was written in C++ and was tested on Linux systems. It is distributed as an open source package and can be run on almost all systems that support C++ with little or no modification.

Here, we present several new programs based on cd-hit algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA database and cd-hit-est-2d compares two nucleotide datasets. The common advantages of these programs are ultrahigh speed and the ability to handle huge databases.

The algorithm behind cd-hit is short word filtering, which can determine that the similarity between two sequences is below a certain value without performing an actual sequence alignment. This algorithm is not limited to protein sequence clustering; it can also be applied to many other analyses that involve a large amount of sequence comparisons.

Motivation: In 2001 and 2002, we published two papers (Bioinformatics, 17, 282–283, Bioinformatics, 18, 77–82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.

We implemented this idea using an index table. For instance, the total number of possible pentapeptides is only 215 (each position has 21 possibilities, 20 amino acids plus ‘X’), and such an index table requires only 4 million entries, which just matches the RAM size of current computers. Index tables maximize the speed of short word counting. Details regarding how to choose an appropriate short word are documented in the cd-hit user's guide.

The original cd-hit program clusters a protein database, and its variant, cd-hit-est, clusters a DNA/RNA database. For eukaryotic genes, long introns can cause long gaps in sequence alignments, which significantly reduces the efficiency of short word filtering. So, practically, cd-hit-est can be applied only for non-intron-containing sequences, such as ESTs.

Program cd-hit-2d compares two protein databases and identifies similar sequences between them above a certain threshold. Cd-hit-est-2d works for two DNA/RNA databases. For the same reason that we mentioned earlier, cd-hit-est-2d is a practical choice only for non-intron-containing sequences.

Funding to pay the Open Access publication charges was provided by the institutional funds of Burnham Institute for Medical Research.

In recent years, the amount of biological sequence data has been growing explosively, which has imposed growing difficulties on analyzing them. The complexity of many sequence analyses is of the order of n2, where n is the number of sequences to be considered. One such example is protein sequence clustering, which groups similar proteins into clusters based on their sequence similarities. To address this computational challenging problem, we developed a novel method and published a program, cd-hit (Li et al., 2001, 2002a), which can efficiently handle huge databases. For example, it takes only a few hours to cluster the NCBI-nr with ∼3 million proteins on a single high-end workstation.

The clustering algorithm in both cd-hit and cd-hit-est is a greedy incremental clustering algorithm. Briefly, sequences are first sorted in order of decreasing length. The longest sequence becomes the representative of the first cluster. Then, each remaining sequence is compared with the representatives of existing clusters. If the similarity with any representative is above a given threshold, it is grouped into that cluster. Otherwise, a new cluster is defined with that sequence as the representative. For each sequence comparison, short word filtering is applied to the sequences to confirm whether the similarity is below the clustering threshold. If this cannot be confirmed, an actual sequence alignment is performed.

In addition to the programs described above, the package contains several utility tools. Some tools help analyze, sort and format the clustering results. One script runs clustering in parallel mode by distributing jobs on a computer cluster (details can be found in the user's guide). The cd-hit package will be under regular maintenance and further development, which will focus on the efficiency at low sequence similarity thresholds. We are also open to adding new functionalities as suggested by users.

Given two databases, db1 and db2, cd-hit-2d or cd-hit-est-2d works in a straightforward way. Sequences in db1 are first sorted in order of decreasing length. Then, each sequence in db2 is compared with db1 from the top (the longest one), and if the similarity to any one in db1 is above the threshold, this sequence is attached to the matched one in db1. At the end of comparing, the program reports matches between db1 and db2 and also outputs a list of proteins in db2 that is not similar to any sequence in db1.

Some example runs are listed in Table 1. All these examples were performed on a Linux workstation with dual 3.0 GHz Xeon processors and 4 GB RAM. The programs used only one processor. For example, cd-hit took <8 h to cluster the NCBI-nr with more than 3.2 million proteins at 90% sequence identity level. Cd-hit-est-2d took a similar amount of time to identify the matches above 95% identity in both strands between human ESTs with ∼6 million sequences and ∼30 thousand human mRNAs.

The details of the algorithm for short word filtering were described in our earlier papers (Li et al., 2001, 2002a). In short, the minimum number of identical short substrings, called ‘words’, such as dipeptides, tripeptides and so on, shared by two proteins is a function of their sequence similarity. We calculated this function by analytical and large-scale statistical analyses. Therefore, we can effectively estimate that the similarity of two sequences is below a certain threshold by simple word counting and without an actual sequence alignment. For nucleotide sequences, we can also obtain such a short word requirement by a similar combination of analytical and statistical analyses.

Once in contact with the concrete casting, the joint expands its volume, sealing and waterproofing the casting from any infiltrations coming from the outside. It fills completely the small voids or gravel nests that are generally frequent at the bottom of vertical castings.

Many options and functions were implemented for the users to control the clustering or comparing process. For example, a useful function is the incremental clustering that offers not only a higher speed but also a stable clustering structure for regularly updated databases. Full set of options are described in the documentation for the program.

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide