Tuesday, October 6, 2020

Andrew Dalke: Faster BitBound ChEMBL search by using more C

This is part of a series of essays I started to write a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

Last Thursday I developed a program which used the BitBound algorithm to do an in-memory search of the uncompressed chembl_27.fps.gz fingerprint file from ChEMBL, containing RDKit Morgan fingerprints.

Yesterday I reorganized the in-memory data structure and developed a new Python/C extension function to speed up linear search by a factor of 7.

In today's essay, I'll merge the two so the BitBound algorithm uses the new extension function.

If you want to jump right to the code:

  1. Download and uncompress chembl_27.fps.gz.
  2. If you don't already have it, download and run popc.py to generate the _popc library. This is unchanged from yesterday's essay.
  3. Download chembl_bitbound_search_in_C.py.
  4. Run python chembl_bitbound_search_in_C.py --times to do the default caffeine search and see the internal timings.
  5. Run python chembl_bitbound_search_in_C.py --help to see the command-line help, then try your own queries, including with different thresholds.

Merge fingerprints in each bin into a single bytestring

In the BitBound essay from last week I sorted the fingerprints by popcount, resulting in 2049 bins. Each bin contains a list of (id, fingerprint) pairs. Here's the original code to create those bins:

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]
for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1]) 
    target_popcount_bins[popcount].append(id_fp_pair)

In yesterday's essay, I reorganized the data to make it easier and faster to access from C. More specifically, I merged the Python list of fingerprint bytestrings into a single bytestring. This required putting the identifiers into one list and the fingerprints into its own list, which is then merged.

I'll do the same for the (id, fingerprint) pairs in the target popcount bins. One thing to watch out for is empty bins, since the zip(*values) technique returns a single empty list, rather than the two empty lists I need. But that's a simple check.

The new conversion code also needs to keep track of the largest number of fingerprints in any of the bins. This is to prevent possible array overflows in the C extension. (In yesterday's program that needed to be large enough to handle all of the target fingerprints, so binning ends up reducing the amount of space needed.)

The following code uses the same loading step from last week, then adds code (in bold) to transform the target_popcount_bins so each bin has a list of ids and the merged fingerprint bytes.

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]
for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1])
    target_popcount_bins[popcount].append(id_fp_pair)

# Split the ids and fingerprints into parallel lists then merge the fingerprints
max_bin_size = 0
for popcount, id_fp_pairs in enumerate(target_popcount_bins):
    if id_fp_pairs:
        target_ids, target_fingerprints = list(zip(*id_fp_pairs))
        target_popcount_bins[popcount] = (target_ids, b"".join(target_fingerprints))
        max_bin_size = max(max_bin_size, len(target_ids))
    else:
        target_popcount_bins[popcount] = [[], b""]

Searching the bins

Last week's BitBound implementation figured out the number of bits in the query fingerprint, determined the min/max BitBound popcounts, then searched each fingerprint of each bin for a sufficiently similar fingerprint. Here's the key part of implementation, which I'll use as a starting point:

has_match = False 
# Determine the BitBound popcount limits
query_popcount = byte_popcount_256(query_fp)
if threshold > 0.0:
    min_popcount = math.ceil(query_popcount * fraction_threshold)
    max_popcount = math.floor(query_popcount / fraction_threshold)
    if max_popcount > 2048:
        max_popcount = 2048
            
# For each popcount in bound, check each of the targets.
for targets in target_popcount_bins[min_popcount:max_popcount+1]:
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")
        
# Print something in case there were no matches.
if not has_match:
    print(f"{query_id}\t*\t*")

In the new algorithm, the BitBounds are the same so that part of the search hasn't changed:

has_match = False 
# Determine the BitBound popcount limits
query_popcount = byte_popcount_256(query_fp)
if threshold > 0.0:
    min_popcount = math.ceil(query_popcount * fraction_threshold)
    max_popcount = math.floor(query_popcount / fraction_threshold)
    if max_popcount > 2048:
        max_popcount = 2048

The search code is somewhat differnet. Each bit has different contents, and I call threshold_tanimoto_search() for the performance critical code. That returns the number of sufficiently similar fingerprints, so I use the same code as yesterday to print the match information:

# For each popcount in bound, check each of the targets.
for target_ids, target_fingerprints in target_popcount_bins[min_popcount:max_popcount+1]:
    # Do the Tanimoto search.
    num_hits = threshold_tanimoto_search(
        query_fp, threshold,
        len(target_ids), target_fingerprints,
        hit_indices, hit_scores)
    if num_hits:
        # Print the hits to stdout, one per line.
        has_match = True
        for i in range(num_hits):
            hit_idx = hit_indices[i]
            score = hit_scores[i]
            target_id = target_ids[hit_idx]
            print(f"{query_id}\t{target_id}\t{score:.3f}")

Lastly, if none of the bins have any matches, print a no match output line (just like the original code):

# Print something in case there were no matches.
if not has_match:
    print(f"{query_id}\t*\t*")

Validation and Timings

I'll do my now-usual timing of nearly 1,000 structures from the Wikipedia Chemical Structure Explorer, and compare if the linear and BitBound in-memory searches give the same answers. I'll run the two programs like this:

% python chembl_in_memory_search_in_C.py --times --queries wikipedia_1000.smi | sort > old_sorted.out
load: 7.22 search: 65.06 #queries: 994 #/sec: 15.28
% python chembl_bitbound_search_in_C.py --times --queries wikipedia_1000.smi | sort > new_sorted.out
load: 8.93 search: 16.58 #queries: 994 #/sec: 59.95
% cmp old_sorted.out new_sorted.out

You can see both programs gave the same output (once sorted, because the BitBound search changes the search order), and the new program is nearly 4x faster. In fact, the performance is roughly 40% that of chemfp's simsearch, so it's doing pretty well!

chemfp advertisement

There are still ways to make the BitBound implementation faster - I've implemented them in chemfp, which is available right now for commercial and no-cost academic licensing.

Now, a 2x speedup may not sound like much, but bear in mind that part of the reason this simple search code is so fast is because it's specialized for the 2048-bit RDKit Morgan2 fingerprints from ChEMBL. If you want to support other fingerprint lengths, you might start with a more general purpose Tanimoto calculation - only to find the general purpose one is also slower. If you want to support other fingerprint parameters, you'll need to write the code to handle those. K-nearest search using BitBound is not so simple, and neither is Tversky search. If you want to support Open Babel, or OEChem, then you'll need the code for that. If you want to improve fingerprint load time, then you'll need to implement that code. And if you want APIs to work with fingerprints and collections of fingerprints, you'll need to write that code too.

Or, you can try out chemfp. To install chemfp under the Chemfp Base License Agreement, use the following:

python -m pip install chemfp -i https://chemp.com/packages/

then send me an email to request an evaluation license key.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...