Monday, October 5, 2020

Andrew Dalke: Faster in-memory ChEMBL search by using more C

This is part of a series of essays I started writing a week ago where I use a few different approaches to implement cheminformatics fingerprint similarity search.

In my previous essay, from last Friday, I developed a program to do a similarity search of the uncompressed chembl_27.fps.gz from ChEMBL containing RDKit Morgan fingerprints. That version used RDKit's BulkTanimotoSimilarity for the Tanimoto calculation. Profiling showed that 80% of the time was spent in the two lines of Python searching for scores at or above a given threshold. I concluded that I had to push more work into C.

That's what I'll do in today's essay, starting with the chembl_in_memory_search.py program from the initial in-memory ChEMBL search I developed earlier last week.

If you want to jump right to the code:

  1. Download and uncompress chembl_27.fps.gz.
  2. Download the updated popc.py.
  3. Run python popc.py to (re)generate the _popc module.
  4. Download chembl_in_memory_search_in_C.py.
  5. Run python chembl_in_memory_search_in_C.py --times to do the default caffeine search and see the internal timings.
  6. Run python chembl_in_memory_search_in_C.py --help to see the command-line help, then try your own queries, including with different thresholds.

Merge target fingerprints into a single byte string

Data representation and performance go hand-in-hand. The original in-memory search code (chembl_in_memory_search.py) stored each fingerprint as a Python byte string, called a function written in a C extension to compute the Tanimoto similarity between the query and target fingerprints, and got the score back. Last Friday's used RDKit's BulkTanimotoSimilarity() to process the Python list at the C++ level, but still required processing a Python list.

The approach I'll use today is to merge all of the target fingerprints into a single byte string. Each 2048-bit fingerprint is 256 bytes long, which makes it very easy to get a fingerprint by index. Storing the fingerprints in one string also saves memory, because each Python byte string has about 33 bytes of overhead, which is used to store information that Python needs internally:

>>> import sys
>>> sys.getsizeof(b"")
33
>>> sys.getsizeof(b"a")
34

That's an overhead of about 12% for each 256-byte fingerprint. If this is instead stored in a single byte string then we save over 60 MB across the 1.9 million fingerprints in ChEMBL.

Load the target fingerprints into a byte string, and get the ids

The existing fingerprint reader iterates over the (id, fingerprint byte string) pairs. I'll separate that into a list of ids and a parallel list of byte strings using the same zip(*values) technique I described on Friday, then merge the list of byte strings into a single byte string (the new part is in bold):

# Load targets as (id, fingerprint) pairs into a list of
# target ids and a block of target fingerprints
t1 = time.time()
target_ids, target_fingerprints = list(zip(*read_fingerprints(open(args.targets))))
num_targets = len(target_fingerprints)
target_fingerprints = b"".join(target_fingerprints)    # Merge into a single byte string
t2 = time.time()
load_time = t2-t1

Put results in a hit list

In the original chembl_in_memory_search.py program I printed the fingerprints as soon as I found one with a Tanimoto score at or above the threshold. I could let the C code print the results to stdout, but I'm going to keep that part of the code in Python. That means I need some way to pass the hit information from C back to Python.

In this series of ChEMBL fingerprint search implementations, I'm trying to keep things simple. The Tanimoto calculation, for example, is hard-coded for 2048-bit fingerprints, and the implementation isn't concerned about hardware issues like word alignment.

I'll keep things simple with the hit list as well. If there are N target fingerprints then hit list cannot be more than N. I can allocate space for N hits, have the C function fill in the values it finds, and return the number of hits found.

At the C level, the hitlist will be two parallel arrays:

int *hit_indices;   /* the indices of a matching target fingerprint */
double *hit_scores; /* the corresponding score (must be >= the search threshold) */

I need some way to allocate these arrays in a way that C code can access them. I've been using the cffi package so the program popc.py can convert C functions into a Python module named _popc. The cffi-generated package includes functions to help generate those arrays from Python:

from _popc import ffi
num_targets = ... the number of fingerprints...
hit_indices = ffi.new(f"int[]", num_targets)
hit_scores = ffi.new(f"double[]", num_targets)

threshold_tanimoto_search()

I then modified popc.py to include the function threshold_tanimoto_search(). It re-uses the byte_tanimoto_256() function described last week. I think the C implementation is a nearly direct mapping of the Python implementation, except that I store the hits in the two arrays rather than print them out. I also think it's understandable enough that I won't provide extra commentary.

/* Find all fingerprints in a sequential block of 2048-bit fingerprints */
/* with a Tanimoto similarity of at least 'threshold' to the query fingerprint. */
/* Return the number of hits. Hit results are in *hit_indices and *hit_scores. */
/* Assumes there is enough space in hit_indices and hit_scores! */
int
threshold_tanimoto_search(unsigned char *query_fp, double threshold,
                          int num_targets, unsigned char *target_fps,
                          int *hit_indices, double *hit_scores) {
    int target_idx, num_hits = 0;
    double score;
    for (target_idx = 0; target_idx<num_targets; target_idx++) {
        score = byte_tanimoto_256(query_fp, target_fps + target_idx*256);
        if (score >= threshold) {
            hit_indices[num_hits] = target_idx;
            hit_scores[num_hits] = score;
            num_hits++;
        }
    }
    return num_hits;
}

(The final popc.py includes the configuration needed to have cffi include the extension code so the new C function can be called from Python. See the program for details.)

I've updated popc.py, which means I need to re-run it in order to generate the updated _popc extension.

% python popc.py
generating ./_popc.c
the current directory is '/Users/dalke/demo'
running build_ext
building '_popc' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g
 -fwrapv -O3 -Wall -I/Users/dalke/venvs/py38-2020-8/include
 -I/Users/dalke/local-3.8/include/python3.8 -c _popc.c -o ./_popc.o -mpopcnt
gcc -bundle -undefined dynamic_lookup ./_popc.o -o ./_popc.cpython-38-darwin.so

Updated ChEMBL similarity search

The function call at the Python level looks like this:

num_hits = threshold_tanimoto_search(
               query_fp, threshold,
               num_targets, target_fingerprints,
               hit_indices, hit_scores)

The parameters are:

  • query_fp - the query fingerprint as a 256-byte bytestring;
  • threshold - the minimum similarity threshold as a Python float betweeen 0.0 and 1.0;
  • num_targets - the number of target fingerprints;
  • target_fingerprints - the num_targets target fingerprints, each 256-bytes long, concatenated into a single byte string;
  • hit_indices - an int* array which will contain the indices of the fingerprints which are sufficiently similar to the query fingerprint;
  • hit_score - a double* array containing the corresponding score;

and the function returns an integer, which is the number of hits found.

That pushes the entire search into C, so what's needed is a driver in Python to do a search for each query and report the hits:

    # Do a threshold search for each query.
    t1 = time.time()
    num_queries = 0
    for query_id, query_fp in query_reader:
        num_queries += 1
        
        # Do the Tanimoto search.
        num_hits = threshold_tanimoto_search(
            query_fp, threshold,
            num_targets, target_fingerprints,
            hit_indices, hit_scores)
        if num_hits:
            # Print the hits to stdout, one per line.
            for i in range(num_hits):
                hit_idx = hit_indices[i]
                score = hit_scores[i]
                target_id = target_ids[hit_idx]
                print(f"{query_id}\t{target_id}\t{score:.3f}")
        else:
            # Print something in case there were no matches.
            print(f"{query_id}\t*\t*")

Does it work?

The final version is available as chembl_in_memory_search_in_C.py.

The C-based similarity search should give identical results to the original Python-based similarity search. Does it? As queries I'll use the first 1,000 structures from the SMILES data I downloaded from the Wikipedia Chemical Structure Explorer. (An earlier essay describes how I converted that data into a SMILES file.) The following omits some error and warning messages.)

% python chembl_in_memory_search.py --queries wikipedia_1000.smi --times > old_1000.out
load: 5.79 search: 530.20 #queries: 994 #/sec: 1.87
% python chembl_in_memory_search_in_C.py--queries wikipedia_1000.smi --times > new_1000.out
load: 7.19 search: 70.40 #queries: 994 #/sec: 14.12
% cmp old_1000.out new_1000.out

This means the two output files are byte-for-byte identical.

Now, compare the two timings. The C implementation is over 7x faster! (Take the numbers with a grain of salt; I ran the program on my laptop, with a video playing in the background when I ran the Python-based implementation.)

Not bad for under 20 lines of C code.

chemfp advertising

As a reminder, one of my goals in this series is to show that it's easy to implement fingerprint similarity search, but hard to implement high-performance similarity search. I hope these details will help convince people to purchase a chemfp license.

To give something concrete, the above C implementation is much faster than the Python version, but chemfp's simsearch is another 10x or so faster still, using the same input files:

% simsearch --queries wikipedia_1000.smi chembl_27.fps --threshold 0.7 --times > chemfp_1000.out
open 5.86 read 0.26 search 6.87 output 0.02 total 13.01

The output isn't in the same format so can't easily be compared. A (long) one-liner to transform the output is:

grep -v '^#' chemfp_1000.out | \
  awk 'BEGIN{FS="\t"} \
  $1==0 {printf("%s\t*\t*\n", $2)} \
  $1>0 {for(i=3; i<NF; i+=2) {printf("%s\t%s\t%.3f\n", $2, $i, $(i+1))}}' | \
  sort > chembl_1000_sorted.out

A comparison shows 4 mismatches where the last digit was off by 1, all because of the extra rounding step in the one-liner needed to switch from simsearch's 5-digit output scores to the 3 decimal places I've been using in the simple programs.

More information about chemfp can be found from the chemfp home page.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...