Chemfp uses 64-bit IEEE 754 floating point (a.k.a. double
or binary64
, or float64
) to represent cheminformatics fingerprint similarity scores.
Some fingerprint search systems instead use 32-bit IEEE 754 floating point (a.k.a. single
or binary32
or float32
), usually to reduce the amount of memory needed, and sometimes also for the higher performance of 32-bit floating point operations.
I chose to represent everything in binary64 for a simple reason: chemfp is designed to work with Python. Python's native float
data type is binary64, and I find that mixing 32-bit and 64-bit floats results in subtle conversion problems that are frustrating to figure out.
(Consider that the binary32 representation closest to 0.6 has a value slightly larger than 0.6, while the corresponding binary64 representation's value is slightly smaller than 0.6.)
Other than possible conversion problems, is there any reason to NOT use a binary32 value?
In this essay I'll show that binary32 isn't able to distinguish some Tanimoto scores for fingerprints which are 4142 bits long, or longer, though in practice this is essentially never a concern.
Real-world Tanimotos scores and binary32
As a brief background, most fingerprints are binary bitstrings of length around 1,000 bits. Let ‖fp‖ be the number of on-bits in a fingerprint bitstring. The most common way to compare two fingerprints is the Jaccard similarity, which in cheminformatics is referred to as the Tanimoto
. It is computed as ‖fp1&fp2‖/‖fp1|fp2‖, and ranges between 0.0 and 1.0. In chemfp, 0/0 = 0.0.
A simple examination shows that 64 bits is overkill. I can't recall ever reading a paper using over 214 = 16,384 bits. This means the numerator and denominator of every Tanimoto requires at most 14 bits, so should easily fit in 32-bits.
This seems like a binary32 value should work. However, some of those bits are used to store values that aren't needed for Tanimoto scores. The sign bit isn't needed because Tanimoto scores can never be negative. A further 8 bits are reserved for 254 possible exponents, but only 7 or so of those exponents are needed.
In the worst case, we only have 24 bits of significand precision, which gives 12 bits each for the numerator and denominator, corresponding to a maximum fingerprint length of 4096 bits.
But surely we can do a bit better than 4096 because the values under 0.1 use a different exponent. Perhaps we can scale up to 4096 / 0.9 = 4551 bits? Furthmore, some of those Tanimoto ratios are not in reduced form, that is, 1/2 = 2/4 = 4/8 = 0.5. Perhaps that gives some more breathing room?
When is 32-bit IEEE 754 floats not enough?
What is the smallest fingerprint length with two distinct possible scores that cannot be distinguished using binary32? The brute-force solution is to enumerate all ratios and report when there are duplicates, as in the following:
# Find the smallest fingerprint length where two different Tanimoto
# scores have the same 32-bit float representation.
import struct
# Convert the value to the 4-byte/32-bit representation.
pack_f32 = struct.Struct("f").pack
# Mapping from packed_score (float-as-bytes) to (value, numerator, denominator)
seen_packed_scores = {}
# Try each ratio
for denominator in range(1, 5000):
for numerator in range(1, denominator):
# This uses Python's native 64-bit float
score = numerator / denominator
# Get the 4-byte representation of the score
packed_score = pack_f32(score)
# Check if that packed score was seen before
prev = seen_packed_scores.get(packed_score, None)
if prev is None:
# Not seen, so record it in case I see it again.
seen_packed_scores[packed_score] = (score, numerator, denominator)
continue
# The packed score can be the same if the scores are the same.
# (4/8 and 5/10 give the same scores)
prev_score, prev_numerator, prev_denominator = prev
if score != prev_score:
print(f"{score} = {numerator}/{denominator} and {prev_numerator}/{prev_denominator}")
raise SystemExit(1)
print("No duplicates found.")
This takes about 10 seconds before it reports:
0.5111057460164172 = 2117/4142 and 2094/4097
That means that binary32 cannot distinguish between the Tanimoto scores 2117/4142 and 2094/4097, so cannot be used to represent all possible Tanimoto scores for fingerprints that are 4142 bits long or longer.
Don't worry!
On the other hand, do you really need to represent all possible Tanimoto scores? Most people don't work with fingerprints beyond about 4096 bits. Most fingerprints are sparse, with an on-bit density of less than 10%. If the fingerprint type is 8192 bits long and has a maximum bit density of 20% then the Tanimoto denominator is at most 3277 bits, which is within the expected limit.
I honestly don't think that real-world work will run into this issue.
To be sure, it is possible to get Tanimoto scores in this range. The densest fingerprints I know are the RDKit fingerprints. The following program uses chemfp's interface to RDKit to find the ChEMBL records with a successively increasing number of bits set, using 16384-bit fingerprints:
import chemfp
from chemfp import bitops
fptype = chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384")
max_num_bits = 0
for id, fp in fptype.read_molecule_fingerprints("chembl_27.sdf.gz", errors="ignore"):
num_bits = bitops.byte_popcount(fp)
if num_bits > max_num_bits:
print(id, num_bits)
max_num_bits = num_bits
The output until I stopped in manually is:
CHEMBL153534 819 CHEMBL440060 1244 CHEMBL440245 1290 CHEMBL440249 2656 CHEMBL504077 3546 CHEMBL501923 3600 CHEMBL501238 3648 CHEMBL441925 3783 CHEMBL439119 5981 CHEMBL438241 5999 CHEMBL1207820 6007 CHEMBL411986 6036 CHEMBL263314 6069 CHEMBL268814 6124 CHEMBL404974 6141 CHEMBL405910 6196 CHEMBL410076 7382 ^C
A spot-check with CHEMBL411986 and CHEMBL1207820, using the SQLite3 file from the ChEMBL distribution to get the corresponding SMILES, is:
import sqlite3
import chemfp
from chemfp import bitops
db = sqlite3.connect("chembl_27.db")
(smiles1, id1), (smiles2, id2) = db.execute("""
SELECT canonical_smiles, chembl_id FROM compound_structures, chembl_id_lookup
WHERE chembl_id_lookup. chembl_id IN ("CHEMBL411986", "CHEMBL1207820")
AND chembl_id_lookup.entity_id = compound_structures.molregno""")
fptype= chemfp.get_fingerprint_type("RDKit-Fingerprint fpSize=16384")
fp1 = fptype.parse_molecule_fingerprint(smiles1, "smistring")
fp2 = fptype.parse_molecule_fingerprint(smiles2, "smistring")
print(f"{id1} has {bitops.byte_popcount(fp1)} bits set.")
print(f"{id2} has {bitops.byte_popcount(fp2)} bits set.")
print(f"intersection has {bitops.byte_intersect_popcount(fp1, fp2)} bits set.")
print(f"union has {bitops.byte_popcount(bitops.byte_union(fp1, fp2))} bits set.")
print(f"Tanimoto({id1}, {id2}) = {bitops.byte_tanimoto(fp1, fp2)}.")
The above program reports:
CHEMBL1207820 has 6007 bits set. CHEMBL411986 has 6036 bits set. intersection has 5744 bits set. union has 6299 bits set. Tanimoto(CHEMBL1207820, CHEMBL411986) = 0.9118907763137006.
This confirms there may be an issues with some fingerprints. Even with 8192-bit fingerprints the Tanimoto is still 4767/5115 which is in the range that cause problems.
However, even supposing I could find a real-world pair of Tanimotos where the actual Tanimoto scores were different but the binary32 scores were the same, the real question is, does that make a difference?
I think not. After all, fingerprints do not give an accurate enough representation of a molecule to justify four digits of precision.
So, don't worry! (Or use a tool like chemfp which doesn't use binary32 floats. ;) ).
Farey sequence
The Python program was easy to write, but a bit slow, and it used a lot of memory to store all of the distinct values. That's why I called it the brute-force
solution.
To be sure, it only took a few seconds, which was more than fast enough to answer the question posed. But what if performance was more critical?
The usual solutions are to switch to PyPy (a faster implementation of the Python language), or to switch to a language like C which usually has less performance overhead.
However, C has no built-in dictionary type, making a direct translation difficult. I could use C++ or other more sophisticated language. Instead, I'll switch algorithms.
Twenty-some-odd years ago, John MacCuish pointed out to me that the possible Tanimoto scores for a given length N were the same as the terms in the Farey sequence for N. This was part of the paper he co-authored describing issues with ties in proximity and clustering compounds.
The Farey sequence for N is the ordered sequence of distinct values between 0 and 1 where each value is a rational with the denominator at most N. Farey gave a simple algorithm for how to generate the reduced rationals that make up that sequence. The following is a shortened version of the Python example in the Wikipedia entry, along with its output for N=4:
def farey_sequence(n):
(a, b, c, d) = (0, 1, 1, n)
yield (0, 1)
while c <= n:
k = (n + b) // d
(a, b, c, d) = (c, d, k * c - a, k * d - b)
yield a, b
>>> list(farey_sequence(4))
[(0, 1), (1, 4), (1, 3), (1, 2), (2, 3), (3, 4), (1, 1)]
That's pretty easy to write in C. The following, farey_check.c
, takes a single command-line value N and reports the first pair of ratios which cannot be distinguished using a C float (which is a binary32):
/* Call this "farey_check.c" */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
int a, b, c, d;
int next_a, next_b, next_c, next_d;
int k;
long n;
float prev_score, score;
char *endptr;
if (argc != 2) {
fprintf(stderr, "Usage: %s N\n", argv[0]);
exit(1);
}
n = strtol(argv[1], &endptr, 10);
if (*endptr != '\0' || n < 1 || n > 10000) {
fprintf(stderr, "N must be an integer between 1 and 10,000\n");
exit(1);
}
/* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */
prev_score = 0.0;
a=0, b=1, c=1, d=n;
while (c <= n) {
/* Get the next value in the Farey sequence */
k = (n + b) / d;
next_a = c;
next_b = d;
next_c = k * c - a;
next_d = k * d - b;
/* Compute the ratio */
score = ((float) next_a) / next_b;
/* Check if a 32-bit float can distinguish this value */
/* from the previous. If not, replace the value and its ratios. */
if (score == prev_score) {
printf("%f = %d/%d and %d/%d\n", score, a, b, next_a, next_b);
break;
}
prev_score = score;
a = next_a;
b = next_b;
c = next_c;
d = next_d;
}
}
I'll compile and run it:
% cc -O2 farey_check.c -o farey_check % ./farey_check 5000 0.500100 = 2499/4997 and 2498/4995 % time ./farey_check 5000 0.500100 = 2499/4997 and 2498/4995 0.056u 0.002s 0:00.06 83.3% 0+0k 0+0io 0pf+0w
It took about 60 milliseconds to find a failure case.
However, that ratio uses larger denominators than the one I found earlier, which was:
0.5111057460164172 = 2117/4142 and 2094/4097
The problem is that the farey_check program finds a failure case for the given value of N, rather than the minimum value of N which has a failure case.
Find minimum N using a Farey sequence
The obvious approach is to test every value of N. That took about 80 seconds to run. While C is significantly faster than Python, the values in the Farey sequence for N+1 includes all of the terms in the Farey sequence for N, which really slows things down.
Instead, I'll use a binary search.
BEWARE! Binary search is notoriously tricky to implement correctly. What I usually do is either use Python's bisect module directly (which contains several binary search variants), or look to its implementation for guidance. Even then, be aware that one of the steps, noted below, may overflow a C integer.
Here's the full search code, which does a binary search between 1 and 10,000 to find the minimim value of N with a failure case, and to report that error case:
#include <stdio.h>
#include <stdlib.h>
/* Return 1 if two values have the same 32-bit float representation, else return 0. */
/* Modified from https://en.wikipedia.org/wiki/Farey_sequence#Next_term */
int farey_check(int n, int *a1, int *b1, int *a2, int *b2) {
int a, b, c, d;
int next_a, next_b, next_c, next_d;
int k;
float score, prev_score;
prev_score = 0.0;
a=0, b=1, c=1, d=n;
while (c <= n) {
/* Get the next value in the Farey sequence */
k = (n + b) / d;
next_a = c;
next_b = d;
next_c = k * c - a;
next_d = k * d - b;
/* Compute the ratio */
score = ((float) next_a) / next_b;
/* Check if a 32-bit float can distinguish this value from */
/* the previous. If not, record the ratios and return 1. */
if (score == prev_score) {
*a1 = a;
*b1 = b;
*a2 = next_a;
*b2 = next_b;
return 1;
}
prev_score = score;
a = next_a;
b = next_b;
c = next_c;
d = next_d;
}
/* No failure case found. */
return 0;
}
int main(int argc, char *argv[]) {
int lo = 1;
int hi = 10000;
int mid;
int a1, b1, a2, b2;
/* bisect_left from Python's bisect module */
while (lo < hi) {
mid = (lo + hi) / 2; /* NOTE: Not valid if lo + hi > MAXINT */
if (farey_check(mid, &a1, &b1, &a2, &b2) < 1) {
lo = mid + 1;
} else {
hi = mid;
}
}
/* Re-run to get the ratios for the failure case. */
farey_check(mid, &a1, &b1, &a2, &b2);
printf("N: %d\n", lo);
printf("%f = %d/%d and %d/%d\n", ((float) a1)/b1, a1, b1, a2, b2);
}
This takes about 0.5 seconds to report:
N: 4142 0.511106 = 2094/4097 and 2117/4142
which is about 20x faster than the original Python program, and it uses constant memory.
It also took several times longer to write. ;)
Tversky similarity
A secondary reason that chemfp uses binary64 is to store Tversky similarity search results in the same data structure as Tanimoto similarity search results.
Tversky similarity uses parameters alpha
and beta
to give different weights to shared bits than the unshared bits. If alpha = beta = 1.0 then it's identical to the Tanimoto similarity.
Suppose N=1024 bits and alpha = 0.15 and beta = 0.85. The possible Tversky scores are still in a Farey sequence, but in this case they are in Farey N*20 because alpha=3/20 and beta=17/20. And not all of the terms in that Farey sequence correspond to an accessible Tversky score.
I think it's easier to store the results in a binary64 than figure out if binary32 has sufficient space to store the results a given similarity method, or worry about possible consequences if it does not.
Advertisement
Download chemfp and try it out using:
python -m pip install chemfp -i https://chemp.com/packages/
No license key needed to use it with the pre-built ChEMBL fingerprints.
Industrial and no-cost academic licensing available. Contact me for a license key to evaluate chemfp with your data sets and ask about pricing.
from Planet Python
via read more
No comments:
Post a Comment