Yesterday I developed an simple program to search chembl_27.fps.gz for records with a Tanimoto similarity of at least 0.7 to caffeine. I started by mentioning the 1986 paper by Willet, Winterman, and Bawden Implementation of nearest-neighbor searching in an online chemical structure search system
. As you can read from the title, that paper does a k-nearest neighbor search (k-NN search
, also called a top-N search
), and compares a Tanimoto similarity search to a cosine similarity search. So really, I'm only halfway there. In this essay I'll go the other half and implement the nearest neighbor search. As before, this code will do almost no error handling, and will only work for the uncompressed chembl_27.fps in the local directory.
If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:
- Download chembl_27.fps.gz.
- Download popc.py.
- Run
python popc.py
to generate the _popc module. - Download chembl_knn_search.py.
- Run
python chembl_knn_search.py
to do the default caffeine search. - Run
python chembl_knn_search.py --help
to see the commandline help, then try your own queries.
Priority queues and heaps
If you need to find the largest element in an unordered list then you can start by assuming the first item is the largest, then test each remaining item. If it's larger than the currently known largest value then use it as the new largest known value. Once done, you have the largest element. Here's an example of what that algorithm looks like:
data = [5, 8, 5, 9, 8, 7, 4] largest = data[0] for value in data[1:]: if value > largest: largest = value
What if you want to find the two largest elements? It's not hard, though it is tedious and error-prone, to extend the algorithm:
data = [5, 8, 5, 9, 8, 7, 4] largest = data[0] second_largest = data[1] if second_largest > largest : largest, second_largest = second_largest, largest for value in data[2:]: if value > second_largest: if value > largest: second_largest, largest = largest, value else: second_largest = value
A manual implementation along these lines rapidly becomes infeasible. Instead, this is a well-known topic. The go-to solution is to use a priority queue, often implemented as a heap.
Python's heqpq module
Python doesn't implement a heap data type. Instead, the heapq module implements the heap operations using the list data type. More specifically, it implements a min heap
, which in practice makes it faster and easier to get the smallest element rather than the top; the smallest item is always the first element of the list. Here's an example where I add the elements 5, 8, 4, 7, and 3 to the heap; you can see that heap[0] is always the smallest, though the ordering of the other elements is less obvious:
>>> import heapq >>> heap = [] >>> heapq.heappush(heap, 5) >>> heap [5] >>> heapq.heappush(heap, 8) >>> heap [5, 8] >>> heapq.heappush(heap, 4) >>> heap [4, 8, 5] >>> heapq.heappush(heap, 7) >>> heap [4, 7, 5, 8] >>> heapq.heappush(heap, 3) >>> heap [3, 4, 5, 8, 7]
Next, I'll iteratively remove the smallest item from the list, and show that the results are from smallest to largest:
>>> heapq.heappop(heap) 3 >>> heap [4, 7, 5, 8] >>> heapq.heappop(heap) 4 >>> heap [5, 7, 8] >>> heapq.heappop(heap) 5 >>> heap [7, 8] >>> heapq.heappop(heap) 7 >>> heap [8] >>> heapq.heappop(heap) 8 >>> heap []
Top-n largest elements in a list
Python's heapq functions can be used to find the largest k elements of a list. Initialize the heap to the first k values of the input. For each of the other values in the input, the value is larger than the current smallest value in the heap, then remove that smallest value and add the new, larger input value. (Python's heapreplace() is more efficient than a heappop() followed by a heappush().)
For example, the following program prints 8 9
, which are the two largest values in the data
list:
import heapq data = [5, 8, 5, 9, 8, 7, 4]w # Initialize the heap with the first two elements heap = [] heapq.heappush(heap, data[0]) heapq.heappush(heap, data[1]) for value in data[2:]: # If a bigger value is found, remove the smallest value # in the heap and add the new value. if value > heap[0]: heapq.heapreplace(heap, value) print(heapq.heappop(heap), heapq.heappop(heap))
Reorganization
A nearest neighbor search is a bit more complicated than a threshold search because there are two stages: 1) filling the heap with k elements, and 2) adding larger elements to the heap.
To help focus on the nearest neighbor search I've reorganized and modified yesterday's example. There are now two functions to help process the inputs:
- get_query_fp(smiles) - parse the SMILES string into an RDKit molecule, generate the Morgan fingerprint with the appropriate values, and return the fingerprint as a byte string;
- read_fingerprints(infile) - parse the FPS-formatted input file. Skip the header section and return an iterator of the (id, fingerprint) pairs, with the fingerprint as a hex-decoded byte string.
Here's they are:
import heapq
from _popc.lib import byte_tanimoto_256
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
# Convert the SMILES into Morgan fingerprint byte string
def get_query_fp(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise SystemExit(f"Cannot parse SMILES {smiles!r}")
query_rd_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(
mol, radius=2, nBits=2048, useChirality=0,
useBondTypes=1, useFeatures=0)
return DataStructs.BitVectToBinaryText(query_rd_fp)
# Read (id, byte fingerprint) from a file-like object
# containing an FPS-formatted fingerprints.
def read_fingerprints(infile):
# Skip the header
for line in infile:
if line[:1] != "#":
break
else:
# Reached the end of file. Stop.
return
# Process the first fingerprint line
target_hex_fp, target_id = line.split()
assert len(target_hex_fp) == 512, "Hard-coded for 2048-bit fingerprints"
yield target_id, bytes.fromhex(target_hex_fp)
# Process the rest of the lines
for line in infile:
target_hex_fp, target_id = line.split()
yield target_id, bytes.fromhex(target_hex_fp)
Find the 10 most similar fingerprints (Tanimoto)
I'll walk through the main function which does the search. First is some initialization code, which converts the query SMILES into a fingerprint and which reads the target fingerprints:
def main(): ## Input parameters smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" k = 10 target_filename = "chembl_27.fps" query_fp = get_query_fp(smiles) with open("chembl_27.fps") as infile: reader = read_fingerprints(infile)
Initialize the heap with k elements
Next I'll initialize the heap with the first k elements from the input file. (The zip() stops when the first of range(k) or reader finishes, so this will read at most k elements.
# Get up to the first k elements and add them to the heap heap = [] for _, (target_id, target_fp) in zip(range(k), reader): score = byte_tanimoto_256(query_fp, target_fp) heapq.heappush(heap, (score, target_id))
You'll see that I added a 2-element tuple to the heap. Because of the way Python works, these heap terms will be compared element-wise, first on the smallest score, with ties broken by comparing the ChEMBL id ASCIIbetically. This will tend to bias the results toward newer ChEMBL records. Another option might be to use the record's position in the input, with a preference either for earlier occurring or later occurring records. Since the records are mostly ASCIIbetically already, the above code is similar to a preference for records which occur later in the file.
Figure out the minimum threshold
Now that I have (up to) k elements in the heap, I know that any new record I find must be larger than the smallest record in the heap. I'll keep track of that. Also, if the smallest record has a score of 1.0 then it's impossible to find a higher score, so I can stop the search:
# Any new match must be larger than the score of the smallest heap element. threshold = heap[0][0] # No need for a further search if all elements match exactly. if threshold == 1.0: pass else:
Search for more similar matches
Here's the heart of the algorithm. Compare each fingerprint. If it's larger than the current smallest score in the heap then remove the smallest term in the heap and add the new one. This might set a new, higher threshold, and if that new threshold is 1.0 then we can stop:
# Test the rest of the fingerprints for target_id, target_fp in reader: score = byte_tanimoto_256(query_fp, target_fp) if score > threshold: heapq.heapreplace(heap, (score, target_id)) # Update the threshold and see if we can stop threshold = heap[0][0] if threshold == 1.0: break
Get the hits
The final step is to print the list of matches, sorted from largest to smallest. What I'll do is heappop() each item into a list of hits. Those will be sorted from smallest to largest, so I'll reverse them, before printing the results (with the score to 3 decimal places):
# Reverse the list because heappop() returns the items from smallest to largest. hits = [heapq.heappop(heap) for _ in range(len(heap))] hits.reverse() # Print the results from largest to smallest for score, target_id in hits: print(f"{target_id}\t{score:.3f}")
__main__
Lastly, I need some way to call the main() function, which I'll do in the typical Python way by only calling it if this code is being run as a program:
if __name__ == "__main__": main()
The 10 most similar matches (Tanimoto)
Now for what you've been waiting for - the 10 most similar matches to caffeine, based on Tanimoto similarity:
CHEMBL113 1.000 CHEMBL1232048 0.710 CHEMBL446784 0.677 CHEMBL2058173 0.667 CHEMBL1738791 0.667 CHEMBL417654 0.647 CHEMBL286680 0.647 CHEMBL2058174 0.647 CHEMBL1200569 0.641 CHEMBL26455 0.618
Find the 10 most similar fingerprints (cosine)
I'm not done yet. The Willett, Winterman, and Bawden paper compared the results of a Tanimoto similarity search to the results of a cosine similarity search. I want to re-implement that sort of comparison.
cosine similarity
Since I decided to not use RDKit (which implements CosineSimilarity()) for my search, I'll need to implement it in my _popc module. I'll show just the new C code here and have you download popc.py for the full details:
static double byte_cosine_256(const unsigned char *fp1, const unsigned char *fp2) { int num_words = 2048 / 64; int dot_product = 0, popcnt1 = 0, popcnt2 = 0; /* Interpret as 64-bit integers and assume possible mis-alignment is okay. */ uint64_t *fp1_64 = (uint64_t *) fp1, *fp2_64 = (uint64_t *) fp2; for (int i=0; i<num_words; i++) { dot_product += __builtin_popcountll(fp1_64[i] & fp2_64[i]); popcnt1 += __builtin_popcountll(fp1_64[i]); popcnt2 += __builtin_popcountll(fp2_64[i]); } if (popcnt1 == 0 || popcnt2 == 0) { return 0.0; } return (dot_product / sqrt((double) (popcnt1 * popcnt2))); }
You'll need to (re)generate the shared library by running python popc.py
.
Check the cosine implementation
If you think I wrote this correctly in one go, then thank you, but your estimation of my programming skills is too high. I used both RDKit and my own Python code to double- and triple-check the implementation. Here's my check code, which generates a large number of fingerprints and random, computes the cosine similarity in three different ways, and compares the results:
from _popc.lib import byte_cosine_256 from rdkit import DataStructs from chemfp import bitops import random N = 2048 bitnos = range(N) def near(x, y): return abs(x-y) < 0.000000001 for i in range(10000): n1 = random.randrange(N+1) n2 = random.randrange(N+1) bitlist1 = random.sample(bitnos, n1) bitlist2 = random.sample(bitnos, n2) rd1 = DataStructs.ExplicitBitVect(2048) rd1.SetBitsFromList(bitlist1) rd2 = DataStructs.ExplicitBitVect(2048) rd2.SetBitsFromList(bitlist2) rd_score = DataStructs.CosineSimilarity(rd1, rd2) bytes1 = bitops.byte_from_bitlist(bitlist1, 2048) bytes2 = bitops.byte_from_bitlist(bitlist2, 2048) popc_score = byte_cosine_256(bytes1, bytes2) dot_product = len((set(bitlist1).intersection(bitlist2))) if n1 and n2: expected_score = dot_product / (n1 * n2) ** 0.5 else: expected_score = 0.0 if not near(popc_score, rd_score) or not near(popc_score, expected_score): bitlist1.sort() bitlist2.sort() print(bitlist1) print(bitlist2) print("dot_product:", dot_product) print(rd_score) print(popc_score) print(expected_score) raise AssertionError
The 10 most similar matches (cosine)
Only a few changes are needed to change the code to use cosine similarity instead of Tanimoto similarity. In short, import the new function:
from _popc.lib import byte_tanimoto_256, byte_cosine_256
and change the two occurences of:
score = byte_tanimoto_256(query_fp, target_fp)
to:
score = byte_cosine_256(query_fp, target_fp)
With that in place, the results are:
CHEMBL113 1.000 CHEMBL1232048 0.832 CHEMBL446784 0.808 CHEMBL2058173 0.803 CHEMBL1738791 0.803 CHEMBL1200569 0.801 CHEMBL417654 0.790 CHEMBL286680 0.790 CHEMBL2058174 0.790 CHEMBL26455 0.767
That list is almost identical to the Tanimoto results. This shouldn't be unexpected. Quoting Willett, Winterman, and Bawden's 1986 paper:
The difference between the two types of coefficient had not been apparent in the earlier property prediction experiments,10 where the data sets were fairly homogeneous in character, with many of them being sets of analogues; in a typical structure file, conversely, a wide range of structural types will be present, and noncommon fragments will play a much larger role in discriminating between structures when a ranking method is used.
HTML output with images
It's a bit difficult to compare the Tanimoto and cosine results. I want to see structures as well. We can embed the image directly from ChEMBL's servers, so I'll add an option for that.
That proved to be complicated and uninteresting, so I'm not going to review it here. Instead, I took all of the above code and created a program to do a k nearest neighbor search of the ChEMBL data set.
The final code
I went a bit overboard, which isn't that unusual for me. I added support for arguments (using Python's argparse module) so you can specify the input SMILES string, the value of k to use, select Tanimoto or cosine search, and select text or HTML output.
It's entirely too complicated to show here. Instead, you can download popc.py (use python popc.py
to have cffi generate the _popc
module) and chembl_knn_search.py and look at all the implementation details.
For your edification, here's the side-by-side comparison of the k=20 nearest neighbor search for caffeine.
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Conclusion
An efficient nearest neighbor search is harder than a threshold search because managing a priority queue requires some thought. With Python we have the advantage of being able to use a built-in module that someone else had developed and debugged. And we can use Python, and packages like cffi which make it easy to pull in C code for performance.
Things were a bit different 30+ years ago, when Willett, Winterman, and Bawden were first doing this sort of study. While priority queues were well known, it may have required pulling out a text book. Or, perhaps they didn't even use a priority queue? Another option is to simply compute all of the similarity scores, sort the results, and display the largest. This might be slower and more memory intensive for large data sets, but it's easier to implement.
I tried reading the 1986 paper and its citations (and some of their citations) to find this detail, but none explicitly said how they implemented ranking.
from Planet Python
via read more
No comments:
Post a Comment