Thursday, October 1, 2020

Andrew Dalke: Simple BitBound ChEMBL similarity search

In yesterday's essay I changed the scan-based Tanimoto search to an in-memory search and showed that after a start-up cost of about 5 seconds I was able to do about 2 searches per second of the 1.9 million ChEMBL fingerprints.

I ended by pointing out how chemfp was over 100x faster.

How is chemfp so much faster? One clear reason is the search code is all implemented in C. A comparison test like "score >= threshold" can be done in one CPU cycle, but the equivalent in Python takes likely thousands of CPU cycles. Another is Python's function call overhead - I suspect it takes longer to set up the call to byte_tanimoto_256() and convert the result to Python than it does to compute the actual Tanimoto.

The chemfp paper covers all of the techniques I used, and reviews the other ones I know about. In this essay I will implement the BitBound algorithm, which is a simple technique that can often reduce the number of fingerprints to test.

If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:

  1. Download chembl_27.fps.gz.
  2. Download the updated popc.py.
  3. Run python popc.py to generate the _popc module.
  4. Download chembl_bitbound_search.py.
  5. Run python chembl_bitbound_search.py --times to do the default caffeine search and see the internal timings.
  6. Run python chembl_bitbound_search.py --help to see the command-line help, then try your own queries, including with different thresholds.

The new program averages about 0.3 seconds per query with the default threshold of 0.7, and with an initial load time of about 8 seconds.

BitBound algorithm

Swamidass and Baldi pointed out that if a query fingerprint A has a bits set, and the goal is to find target fingerprints Bi which have a Tanimoto score at least T similar to the query, then we can set bounds on the number of bits b in the target fingerprints. More specifically: a T ≤ b ≤ a / T.

(They generalize this calculation to Tversky similarity, and also describe an search method to improve a nearest neighbor search.)

What this means is, if the number of bits in the target fingerprints can be pre-computed then those fingerprints which are out-of-bounds for a given query don't need to be considered.

There are few ways to implement this. One is to add the pre-computed popcount to each record. However, this still means testing every record. Another is to sort the records into different bins, one for each popcount. Then the search need only test the bins which might have a similarity match. I'll implement this seecond approach.

Compute the fingerprint popcount

I don't have a way to compute the popcount of each of the target fingerprints. I decided to use byte strings for them and used cffi to write popc.py, a program to generates the _popc module with the Tanimoto similarity and cosine similarity functions, hard-coded for 2048-bit fingerprints. I'll extend that module so it implements byte_popcount_256(), to compute the popcount of a 256-byte byte string.

The implementation is a straight-forward simplification of the similarity functions, so I'll present it here without further description (see popc.py for the full implementation):

static int
byte_popcount_256(const unsigned char *fp) {
    int num_words = 2048 / 64;
    int popcnt = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp_64 = (uint64_t *) fp;

    for (int i=0; i<num_words; i++) {
        popcnt += __builtin_popcountll(fp_64[i]);
    }
    return popcnt;
}

Pre-compute the target fingerprints and put into bins

Just like in yesterday's essay, I'll first read the fingerprints into memory and then search them. However, this time I'll organize the fingerprints by popcount. There are 2049 possible values (from 0 bits set to all 2048 bits set), so I'll create a list of 2049 empty lists. For each fingerprint record, I'll compute the fingerprint popcount and append it to the appropriate list. Here's what that code looks like:

# Load targets into a popcount bins, ordered by the fingerprint popcount.
# Each bin contains a list of (id, fingerprint) pairs.
t1 = time.time()

# Create 2049 bins, each with an empty list, ordered by popcount.
target_popcount_bins = [[] for i in range(2049)]

for id_fp_pair in read_fingerprints(open(args.targets)):
    # Place each record into the right bin.
    popcount = byte_popcount_256(id_fp_pair[1])
    target_popcount_bins[popcount].append(id_fp_pair)

t2 = time.time()
load_time = t2-t1

Determine the bounds

It seems easy to compute the bit bounds with something like:

query_popcount = byte_popcount_256(query_fp)
min_popcount = query_popcount * threshold
max_popcount = query_popcount / threshold

and then use the integer popcount values between min_popcount and max_popcount. The lowest popcount value is the integer greater than or equal to min_popcount and the highest popcount value is the integer no greater than max_popcount. You might think this would work:

# This is wrong!
import math
query_popcount = byte_popcount_256(query_fp)
min_popcount = math.ceil(query_popcount * threshold)
max_popcount = math.floor(query_popcount / threshold)
for popcount in range(min_popcount, max_popcount+1):
   ...

In practice, this is tricky because the calculations will use IEEE 754 doubles. Some operations which should ideally result in an integer value ... don't. Here is an example where the min_popcount would be wrong:

>>> threshold = 0.55
>>> query_popcount = 1580
>>> threshold * query_popcount
869.0000000000001
>>> import math
>>> math.ceil(threshold * query_popcount)
870

and here's an example where the max_popcount would be wrong:

>>> threshold = 0.55
>>> query_popcount = 396
>>> query_popcount / threshold
719.9999999999999
>>> math.floor(query_popcount / threshold)
719

Solution #1: change by epsilon

One solution, which will work for normal threshold values, is to add or subtract a small value to ratio before doing the floor or ceiling call.

>>> import math
>>> EPSILON = 1e-10
>>> threshold = 0.55
>>> threshold * 1580
869.0000000000001
>>> math.ceil(threshold * 1580 - EPSILON)
869
>>> 396 / threshold
719.9999999999999
>>> math.floor(396 / threshold)
719
>>> math.floor(396 / threshold + EPSILON)
720

This works because 1) it's chemically unreasonable to specify a threshold with more than 3 significant digits, and 2) the differences between fingerprint scores cannot be more than 1/(20482).

Solution #2: use rationals

I think a better solution is to do the calculation using rationals, so the operations can be done with integers. (This is especially important if you want to handle the Tversky BitBounds; in chemfp I gave up trying to get IEEE 754 math to work correctly and ended up multiplying alpha and beta values by 10,000 to work with integers.)

Since I'm using Python, I can use Python's built-in fractions module, which implements a rational data type:

>>> import math, fractions
>>> threshold = fractions.Fraction("0.55")
>>> math.ceil(threshold * 1580)
869
>>> math.floor(396 / threshold)
720

(I actually used the fractions module to find the examples where the simple application of IEEE 7554 doubles cause problems.)

Only a few minor changes are needed to use a fraction for the threshold, starting with the argparse --threshold parameter:

parser.add_argument("--threshold", "-t", type=fractions.Fraction, default=fractions.Fraction("0.7"),
                        help="minimum threshold (default: 0.7)")
For performance reasons, I keep track of the fraction threshold (only used to find the popcount bonds) and the floating point value (used when checking the Tanimoto score):
fraction_threshold = args.threshold
threshold = float(fraction_threshold)

The popcount bounds calculation, when using a fraction threshold, is the straight-forward, expected one:

min_popcount = 0
max_popcount = 2048
for query_id, query_fp in query_reader:
    ...
    query_popcount = byte_popcount_256(query_fp)
    if threshold > 0.0:
        min_popcount = math.ceil(query_popcount * fraction_threshold)
        max_popcount = math.floor(query_popcount / fraction_threshold)
        if max_popcount > 2048:
            max_popcount = 2048

The check for threshold > 0.0 is to prevent divide-by-zero errors. Of course, if the threshold is 0.0 then everything will match, and there will be a lot of overhead just printing the results.

Search the appropriate bins

Now that I have the bounds I can get the bins that are in range, and for each bin check each of the target fingerprints:

for targets in target_popcount_bins[min_popcount:max_popcount+1]:
    for target_id, target_fp in targets:
        score = similarity_function(query_fp, target_fp)
        # If it's similar enough, print details.
        if score >= threshold:
            has_match = True
            print(f"{query_id}\t{target_id}\t{score:.3f}")

Does it work?

The final code is in chembl_bitbound_search.py. But does it work?

I used chembl_in_memory_search.py from yesterday's essay to process the first 1,000 entries from the SMILES downloaded from the Wikipedia Chemical Structure Explorer, then processed the same entries with chembl_bitbound_search.py, then compared the two. The two programs have a different search order so the outputs cannot be compared directly. Instead, I compared the sorted outputs to each other (and excluded the warning and error messages in the following):

% python chembl_in_memory_search.py --times --queries wikipedia_1000.smi | sort > old.txt
load: 5.27 search: 491.73 #queries: 994 #/sec: 2.02
% python chembl_bitbound_search.py --times --queries wikipedia_1000.smi | sort > new.txt
load: 8.10 search: 297.42 #queries: 994 #/sec: 3.34
% cmp old.txt new.txt

They matched, and it was almost 40% faster. So that's a good first test. I'll also run with a higher threshold to see if the overall performance changes:

% python chembl_bitbound_search.py --times --queries wikipedia_1000.smi --threshold 0.9 & /dev/null
load: 5.53 search: 81.74 #queries: 994 #/sec: 12.16

Nice speedup!

A more sensitive test would be to construct fingerprints with bit patterns that trigger edge cases in the code, and to add instrumentation to ensure the BitBound calculation is not accidentally too wide. That's too much for this essay.

Can improve the Tanimoto calculation performance

If the target fingerprints are sorted by popcount then we know that all fingerprint in a given bin have the same count. We also know the number of bits in the query fingerprint. This can be used to improve the search performance by reducing the amount of work to compute the Tanimoto.

The heart of the current code, in byte_tanimoto_256(), is:

for (int i=0; i<num_words; i++) {
    intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
    union_popcount += __builtin_popcountll(fp1_64[i] | fp2_64[i]);
}
return ((double) intersect_popcount) / union_popcount;

If the query and target popcounts are known then this reduces to:

for (int i=0; i<num_words; i++) {
    intersect_popcount += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
}
return ((double) intersect_popcount) /
        (query_popcount + target_popcount - intersect_popcount);

I'll leave that as an exercise for the student.

Can skip the Tanimoto calculation performance

Another possible speedup is to avoid the division in the Tanimoto calculation altogether. If you work out the math you'll find that the target fingerprint will pass the test if the intersection popcount C is:

C ≥ math.floor((threshold * (query_popcount + target_popcount) / (1 + threshold)))

This calculation is another place where it's easier to calculate using rationals than with IEEE 754 doubles.

chemfp advertisement

I've been trying to make the point that while it's easy to implement a similarity search program, there are a lot of complicated details to make a fast similarity search program.

You could go ahead and spend the days or weeks to implement these features yourself, or you could install and test chemfp chemfp (under the Base License Agreement) by doing:

python -m pip install chemfp -i https://chemfp.com/packages/

The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...