Tuesday, September 29, 2020

Andrew Dalke: Simple k-NN FPS Tanimoto and cosine similarity search

Yesterday I developed an simple program to search chembl_27.fps.gz for records with a Tanimoto similarity of at least 0.7 to caffeine. I started by mentioning the 1986 paper by Willet, Winterman, and Bawden Implementation of nearest-neighbor searching in an online chemical structure search system. As you can read from the title, that paper does a k-nearest neighbor search (k-NN search, also called a top-N search), and compares a Tanimoto similarity search to a cosine similarity search. So really, I'm only halfway there. In this essay I'll go the other half and implement the nearest neighbor search. As before, this code will do almost no error handling, and will only work for the uncompressed chembl_27.fps in the local directory.

If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:

  1. Download chembl_27.fps.gz.
  2. Download popc.py.
  3. Run python popc.py to generate the _popc module.
  4. Download chembl_knn_search.py.
  5. Run python chembl_knn_search.py to do the default caffeine search.
  6. Run python chembl_knn_search.py --help to see the commandline help, then try your own queries.

Priority queues and heaps

If you need to find the largest element in an unordered list then you can start by assuming the first item is the largest, then test each remaining item. If it's larger than the currently known largest value then use it as the new largest known value. Once done, you have the largest element. Here's an example of what that algorithm looks like:

data = [5, 8, 5, 9, 8, 7, 4]
largest = data[0]
for value in data[1:]:
    if value > largest:
        largest = value 

What if you want to find the two largest elements? It's not hard, though it is tedious and error-prone, to extend the algorithm:

data = [5, 8, 5, 9, 8, 7, 4]
largest = data[0]
second_largest = data[1]
if second_largest > largest :
    largest, second_largest = second_largest, largest

for value in data[2:]:
    if value > second_largest:
        if value > largest:
            second_largest, largest = largest, value
        else:
            second_largest = value

A manual implementation along these lines rapidly becomes infeasible. Instead, this is a well-known topic. The go-to solution is to use a priority queue, often implemented as a heap.

Python's heqpq module

Python doesn't implement a heap data type. Instead, the heapq module implements the heap operations using the list data type. More specifically, it implements a min heap, which in practice makes it faster and easier to get the smallest element rather than the top; the smallest item is always the first element of the list. Here's an example where I add the elements 5, 8, 4, 7, and 3 to the heap; you can see that heap[0] is always the smallest, though the ordering of the other elements is less obvious:

>>> import heapq
>>> heap = []
>>> heapq.heappush(heap, 5)
>>> heap
[5]
>>> heapq.heappush(heap, 8)
>>> heap
[5, 8]
>>> heapq.heappush(heap, 4)
>>> heap
[4, 8, 5]
>>> heapq.heappush(heap, 7)
>>> heap
[4, 7, 5, 8]
>>> heapq.heappush(heap, 3)
>>> heap
[3, 4, 5, 8, 7]

Next, I'll iteratively remove the smallest item from the list, and show that the results are from smallest to largest:

>>> heapq.heappop(heap)
3
>>> heap
[4, 7, 5, 8]
>>> heapq.heappop(heap)
4
>>> heap
[5, 7, 8]
>>> heapq.heappop(heap)
5
>>> heap
[7, 8]
>>> heapq.heappop(heap)
7
>>> heap
[8]
>>> heapq.heappop(heap)
8
>>> heap
[]

Top-n largest elements in a list

Python's heapq functions can be used to find the largest k elements of a list. Initialize the heap to the first k values of the input. For each of the other values in the input, the value is larger than the current smallest value in the heap, then remove that smallest value and add the new, larger input value. (Python's heapreplace() is more efficient than a heappop() followed by a heappush().)

For example, the following program prints 8 9, which are the two largest values in the data list:

import heapq
data = [5, 8, 5, 9, 8, 7, 4]w

# Initialize the heap with the first two elements
heap = []
heapq.heappush(heap, data[0])
heapq.heappush(heap, data[1])

for value in data[2:]:
    # If a bigger value is found, remove the smallest value
    # in the heap and add the new value.
    if value > heap[0]:
        heapq.heapreplace(heap, value)

print(heapq.heappop(heap), heapq.heappop(heap))

Reorganization

A nearest neighbor search is a bit more complicated than a threshold search because there are two stages: 1) filling the heap with k elements, and 2) adding larger elements to the heap.

To help focus on the nearest neighbor search I've reorganized and modified yesterday's example. There are now two functions to help process the inputs:

  • get_query_fp(smiles) - parse the SMILES string into an RDKit molecule, generate the Morgan fingerprint with the appropriate values, and return the fingerprint as a byte string;
  • read_fingerprints(infile) - parse the FPS-formatted input file. Skip the header section and return an iterator of the (id, fingerprint) pairs, with the fingerprint as a hex-decoded byte string.

Here's they are:

import heapq
from _popc.lib import byte_tanimoto_256
    
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs

# Convert the SMILES into Morgan fingerprint byte string
def get_query_fp(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise SystemExit(f"Cannot parse SMILES {smiles!r}")

    query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
        mol, radius=2, nBits=2048, useChirality=0,
        useBondTypes=1, useFeatures=0)

    return DataStructs.BitVectToBinaryText(query_rd_fp)

# Read (id, byte fingerprint) from a file-like object
# containing an FPS-formatted fingerprints.
def read_fingerprints(infile):
    # Skip the header
    for line in infile:
        if line[:1] != "#":
            break
    else:
        # Reached the end of file. Stop.
        return

    # Process the first fingerprint line
    target_hex_fp, target_id = line.split()
    assert len(target_hex_fp) == 512, "Hard-coded for 2048-bit fingerprints"
    yield target_id, bytes.fromhex(target_hex_fp)

    # Process the rest of the lines
    for line in infile:
        target_hex_fp, target_id = line.split()
        yield target_id, bytes.fromhex(target_hex_fp)

Find the 10 most similar fingerprints (Tanimoto)

I'll walk through the main function which does the search. First is some initialization code, which converts the query SMILES into a fingerprint and which reads the target fingerprints:

def main():
    ## Input parameters
    smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
    k = 10
    target_filename = "chembl_27.fps"

    query_fp = get_query_fp(smiles)

    with open("chembl_27.fps") as infile:
        reader = read_fingerprints(infile)

Initialize the heap with k elements

Next I'll initialize the heap with the first k elements from the input file. (The zip() stops when the first of range(k) or reader finishes, so this will read at most k elements.

        # Get up to the first k elements and add them to the heap
        heap = []
        for _, (target_id, target_fp) in zip(range(k), reader):
            score = byte_tanimoto_256(query_fp, target_fp)
            heapq.heappush(heap, (score, target_id))

You'll see that I added a 2-element tuple to the heap. Because of the way Python works, these heap terms will be compared element-wise, first on the smallest score, with ties broken by comparing the ChEMBL id ASCIIbetically. This will tend to bias the results toward newer ChEMBL records. Another option might be to use the record's position in the input, with a preference either for earlier occurring or later occurring records. Since the records are mostly ASCIIbetically already, the above code is similar to a preference for records which occur later in the file.

Figure out the minimum threshold

Now that I have (up to) k elements in the heap, I know that any new record I find must be larger than the smallest record in the heap. I'll keep track of that. Also, if the smallest record has a score of 1.0 then it's impossible to find a higher score, so I can stop the search:

        # Any new match must be larger than the score of the smallest heap element.
        threshold = heap[0][0]

        # No need for a further search if all elements match exactly.
        if threshold == 1.0:
            pass
        else:

Search for more similar matches

Here's the heart of the algorithm. Compare each fingerprint. If it's larger than the current smallest score in the heap then remove the smallest term in the heap and add the new one. This might set a new, higher threshold, and if that new threshold is 1.0 then we can stop:

            # Test the rest of the fingerprints
            for target_id, target_fp in reader:
                score = byte_tanimoto_256(query_fp, target_fp)
                if score > threshold:
                    heapq.heapreplace(heap, (score, target_id))
                    # Update the threshold and see if we can stop
                    threshold = heap[0][0]
                    if threshold == 1.0:
                        break

Get the hits

The final step is to print the list of matches, sorted from largest to smallest. What I'll do is heappop() each item into a list of hits. Those will be sorted from smallest to largest, so I'll reverse them, before printing the results (with the score to 3 decimal places):

    # Reverse the list because heappop() returns the items from smallest to largest.
    hits = [heapq.heappop(heap) for _ in range(len(heap))]
    hits.reverse()

    # Print the results from largest to smallest
    for score, target_id in hits:
        print(f"{target_id}\t{score:.3f}")

__main__

Lastly, I need some way to call the main() function, which I'll do in the typical Python way by only calling it if this code is being run as a program:

if __name__ == "__main__":
    main()

The 10 most similar matches (Tanimoto)

Now for what you've been waiting for - the 10 most similar matches to caffeine, based on Tanimoto similarity:

CHEMBL113    1.000
CHEMBL1232048    0.710
CHEMBL446784      0.677
CHEMBL2058173    0.667
CHEMBL1738791    0.667
CHEMBL417654      0.647
CHEMBL286680      0.647
CHEMBL2058174    0.647
CHEMBL1200569    0.641
CHEMBL26455        0.618

Find the 10 most similar fingerprints (cosine)

I'm not done yet. The Willett, Winterman, and Bawden paper compared the results of a Tanimoto similarity search to the results of a cosine similarity search. I want to re-implement that sort of comparison.

cosine similarity

Since I decided to not use RDKit (which implements CosineSimilarity()) for my search, I'll need to implement it in my _popc module. I'll show just the new C code here and have you download popc.py for the full details:

static double
byte_cosine_256(const unsigned char *fp1, const unsigned char *fp2) {
    int num_words = 2048 / 64;
    int dot_product = 0, popcnt1 = 0, popcnt2 = 0;

    /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */
    uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2;

    for (int i=0; i<num_words; i++) {
        dot_product += __builtin_popcountll(fp1_64[i] & fp2_64[i]);
        popcnt1 += __builtin_popcountll(fp1_64[i]);
        popcnt2 += __builtin_popcountll(fp2_64[i]);
    }
    if (popcnt1 == 0 || popcnt2 == 0) {
        return 0.0;
    }
    return (dot_product / sqrt((double) (popcnt1 * popcnt2)));
}

You'll need to (re)generate the shared library by running python popc.py.

Check the cosine implementation

If you think I wrote this correctly in one go, then thank you, but your estimation of my programming skills is too high. I used both RDKit and my own Python code to double- and triple-check the implementation. Here's my check code, which generates a large number of fingerprints and random, computes the cosine similarity in three different ways, and compares the results:

from _popc.lib import byte_cosine_256
from rdkit import DataStructs
from chemfp import bitops

import random

N = 2048

bitnos = range(N)

def near(x, y):
    return abs(x-y) < 0.000000001

for i in range(10000):
    n1 = random.randrange(N+1)
    n2 = random.randrange(N+1)
    bitlist1 = random.sample(bitnos, n1)
    bitlist2 = random.sample(bitnos, n2)

    rd1 = DataStructs.ExplicitBitVect(2048)
    rd1.SetBitsFromList(bitlist1)
    
    rd2 = DataStructs.ExplicitBitVect(2048)
    rd2.SetBitsFromList(bitlist2)

    rd_score = DataStructs.CosineSimilarity(rd1, rd2)

    bytes1 = bitops.byte_from_bitlist(bitlist1, 2048)
    bytes2 = bitops.byte_from_bitlist(bitlist2, 2048)

    popc_score = byte_cosine_256(bytes1, bytes2)

    dot_product = len((set(bitlist1).intersection(bitlist2)))
    if n1 and n2:
        expected_score = dot_product / (n1 * n2) ** 0.5
    else:
        expected_score = 0.0
    
    if not near(popc_score, rd_score) or not near(popc_score, expected_score):
        bitlist1.sort()
        bitlist2.sort()
        print(bitlist1)
        print(bitlist2)
        print("dot_product:", dot_product)
        print(rd_score)
        print(popc_score)
        print(expected_score)
        raise AssertionError

The 10 most similar matches (cosine)

Only a few changes are needed to change the code to use cosine similarity instead of Tanimoto similarity. In short, import the new function:

from _popc.lib import byte_tanimoto_256, byte_cosine_256

and change the two occurences of:

score = byte_tanimoto_256(query_fp, target_fp)

to:

score = byte_cosine_256(query_fp, target_fp)

With that in place, the results are:

CHEMBL113    1.000
CHEMBL1232048    0.832
CHEMBL446784      0.808
CHEMBL2058173    0.803
CHEMBL1738791    0.803
CHEMBL1200569    0.801
CHEMBL417654      0.790
CHEMBL286680      0.790
CHEMBL2058174    0.790
CHEMBL26455        0.767

That list is almost identical to the Tanimoto results. This shouldn't be unexpected. Quoting Willett, Winterman, and Bawden's 1986 paper:

The difference between the two types of coefficient had not been apparent in the earlier property prediction experiments,10 where the data sets were fairly homogeneous in character, with many of them being sets of analogues; in a typical structure file, conversely, a wide range of structural types will be present, and noncommon fragments will play a much larger role in discriminating between structures when a ranking method is used.

HTML output with images

It's a bit difficult to compare the Tanimoto and cosine results. I want to see structures as well. We can embed the image directly from ChEMBL's servers, so I'll add an option for that.

That proved to be complicated and uninteresting, so I'm not going to review it here. Instead, I took all of the above code and created a program to do a k nearest neighbor search of the ChEMBL data set.

The final code

I went a bit overboard, which isn't that unusual for me. I added support for arguments (using Python's argparse module) so you can specify the input SMILES string, the value of k to use, select Tanimoto or cosine search, and select text or HTML output.

It's entirely too complicated to show here. Instead, you can download popc.py (use python popc.py to have cffi generate the _popc module) and chembl_knn_search.py and look at all the implementation details.

For your edification, here's the side-by-side comparison of the k=20 nearest neighbor search for caffeine.

Comparison of k=20 Tanimoto and cosine search for caffeine
Tanimoto
cosine
ChEMBL ID score image
CHEMBL113 1.000
CHEMBL1232048 0.710
CHEMBL446784 0.677
CHEMBL2058173 0.667
CHEMBL1738791 0.667
CHEMBL417654 0.647
CHEMBL286680 0.647
CHEMBL2058174 0.647
CHEMBL1200569 0.641
CHEMBL26455 0.618
CHEMBL304319 0.611
CHEMBL25096 0.611
CHEMBL699 0.600
CHEMBL66011 0.600
CHEMBL23903 0.600
CHEMBL67109 0.595
CHEMBL419617 0.595
CHEMBL307326 0.595
CHEMBL302725 0.595
CHEMBL282038 0.595
ChEMBL ID score image
CHEMBL113 1.000
CHEMBL1232048 0.832
CHEMBL446784 0.808
CHEMBL2058173 0.803
CHEMBL1738791 0.803
CHEMBL1200569 0.801
CHEMBL417654 0.790
CHEMBL286680 0.790
CHEMBL2058174 0.790
CHEMBL26455 0.767
CHEMBL304319 0.766
CHEMBL25096 0.766
CHEMBL67109 0.755
CHEMBL419617 0.755
CHEMBL307326 0.755
CHEMBL302725 0.755
CHEMBL282038 0.755
CHEMBL25427 0.755
CHEMBL699 0.754
CHEMBL66011 0.754

Conclusion

An efficient nearest neighbor search is harder than a threshold search because managing a priority queue requires some thought. With Python we have the advantage of being able to use a built-in module that someone else had developed and debugged. And we can use Python, and packages like cffi which make it easy to pull in C code for performance.

Things were a bit different 30+ years ago, when Willett, Winterman, and Bawden were first doing this sort of study. While priority queues were well known, it may have required pulling out a text book. Or, perhaps they didn't even use a priority queue? Another option is to simply compute all of the similarity scores, sort the results, and display the largest. This might be slower and more memory intensive for large data sets, but it's easier to implement.

I tried reading the 1986 paper and its citations (and some of their citations) to find this detail, but none explicitly said how they implemented ranking.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...