In the previous two essays I showed how to search chembl_27.fps to find records with similar fingerprints to a query fingerprint, then how to implement a nearest-neighbor search and replace Tanimoto similarity with cosine similarity. The final program took about 5 seconds.
In this essay I'll show how to increase the search performance by shifting more of the work to a load step. This sort of precomputation can be useful if the load step is done once, with the extra overhead shared over multiple searches.
If you want to skip all the discussion, here are the instructions to use the final code, which assumes you have a gcc or clang C compiler and that RDKit's Python bindings are installed:
- Download chembl_27.fps.gz.
- Download popc.py.
- Run
python popc.py
to generate the _popc module. - Download chembl_in_memory_search.py.
- Run
python chembl_in_memory_search.py --times
to do the default caffeine search and see the internal timings. - Run
python chembl_in_memory_search.py --help
to see the command-line help, then try your own queries, including from a SMILES or FPS file.
The new program averages about 0.5 seconds per query, with an initial 4-5 second load time.
Measuring FPS parsing overhead
I profiled the program and found that most of the time was spent in parsing the FPS file, not in computing the Tanimoto. To profile, I wrapped the entire program into a function (the profiler I'm using works on functions), used the @profile
decorator, which tells kernprof
which function(s) to profile, profiled the result with:
% kernprof -l ssimsearch_bytes.py
then viewed the profiler results with
% python -m line_profiler ssimsearch_bytes.py.lprof Timer unit: 1e-06 s Total time: 13.0334 s File: ssimsearch_bytes.py Function: main at line 1 Line # Hits Time Per Hit % Time Line Contents ============================================================== 1 @profile 2 def main(): 3 1 6859.0 6859.0 0.1 from _popc.lib import byte_tanimoto_256 4 5 1 224418.0 224418.0 1.7 from rdkit import Chem 6 1 6.0 6.0 0.0 from rdkit import DataStructs 7 1 5629.0 5629.0 0.0 from rdkit.Chem import rdMolDescriptors 8 9 1 1.0 1.0 0.0 smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" 10 1 1.0 1.0 0.0 threshold = 0.7 11 12 1 11504.0 11504.0 0.1 mol = Chem.MolFromSmiles(smiles) 13 1 2.0 2.0 0.0 if mol is None: 14 raise SystemExit(f"Cannot parse SMILES {smiles!r}") 15 16 2 98.0 49.0 0.0 rd_query_fp = rdMolDescriptors.GetMorganFingerprintAsBitVect( 17 1 1.0 1.0 0.0 mol, radius=2, nBits=2048, useChirality=0, 18 1 0.0 0.0 0.0 useBondTypes=1, useFeatures=0) 19 1 15.0 15.0 0.0 query_fp = DataStructs.BitVectToBinaryText(rd_query_fp) 20 21 22 1 97.0 97.0 0.0 infile = open("chembl_27.fps") 23 7 8.0 1.1 0.0 for i in range(6): 24 6 32.0 5.3 0.0 line = infile.readline() 25 1 3.0 3.0 0.0 assert line.startswith("#date") 26 27 1941411 3356477.0 1.7 25.8 for line in infile: 28 1941410 3207597.0 1.7 24.6 target_hex_fp, target_id = line.split() 29 1941410 2605370.0 1.3 20.0 target_fp = bytes.fromhex(target_hex_fp) 30 31 1941410 2071065.0 1.1 15.9 score = byte_tanimoto_256(query_fp, target_fp) 32 1941410 1544099.0 0.8 11.8 if score >= threshold: 33 2 98.0 49.0 0.0 print(f"{target_id}\t{score}")
Lines 27-29 are all part of parsing the FPS file, and they take 70% of the search time. The actual Tanimoto calculation (line 31) and score test (line 32) take together about 25% of the time.
Multiple queries
What if you want to search for multiple queries? One implementation might do the full search of the first query, then of the second, and so on. This would take about 5*N seconds.
The above profile suggests that if you could load the fingerprint data into a more easily processed form, where the pre-processing was done once, then the overall performance would be faster. This may be useful in a web service where extra work can be done during startup in order to improve the per-request performance.
I changed the main loop (lines 27-33 in the above) so there are two stages. Stage 1 loads all of the fingerprints into memory, and stage 2 searches those fingerprints, and added timing measurements for each stage:
import time t1 = time.time() targets = [] for line in infile: target_hex_fp, target_id = line.split() target_fp = bytes.fromhex(target_hex_fp) targets.append( (target_id, target_fp) ) t2 = time.time() for target_id, target_fp in targets: score = byte_tanimoto(query_fp, target_fp) if score >= threshold: print(f"{target_id}\t{score}")
This reports:
Load: 4.64 Search: 0.79
What multiple queries?
Previously I supported a hard-coded SMILES string, or a user-defined SMILES string on the command-line. If there are going to be multiple queries, what are they, and where do they come from?
I decided I was going to support a SMILES file or FPS file as input, based on if the given query filename ends with .smi
or not. I already have a way to read_fingerprints() from an FPS file, and I have a way to compute the RDKit Morgan fingerprint given a SMILES, so all I need is a way to read a SMILES file that yields the same (id, fingerprint) pairs as read_fingerprints().
SMILES file
There are two main types of SMILES files: Daylight-like and CSV-like. A Daylight SMILES file contain one record per line. Each line contains a SMILES followed by a whitespace character followed by a molecule title/id, up to the rest of the line.
A CSV-like SMILES file contains a SMILES followed by a space or tab seperator character, followed by an molecule title/id, followed optionally by successive seperator characters and additional fields. In addition, the first line may contain column headers.
I'm going to support Daylight-like SMILES file.
SMILES file parser
It's pretty easy to read either SMILES file variant, though more complex if you want full configurability. In short, for every line in the file, split on the first whitespace. The first field is the SMILES, the second is the identifier. Convert the SMILES to a fingerprint, or skip it if the SMILES could not be processed. That give the needed (id, fingerprint) pairs:
# Read a SMILES file, convert the SMILES to a fingerprint, # and yield (id, fingerprint) pairs for each parsable record. def read_structure_fingerprints_from_smiles_file(infile): for lineno, line in enumerate(infile, 1): # Expecting the format: <SMILES> + whitespace + <title> + "\n" smiles, id = line[:-1].split(None, 1) try: # Get the corresponding RDKit Morgan fingerprint fp = get_query_fp(smiles) except ValueError as err: # Skip records which can't be parsed sys.stderr.write(f"ERROR: {err}: Skipping line {lineno}\n") continue yield id, fp
Reporting search results
Since there may be multiple queries, each with 0 or more hits, I need some way to figure out which query matches which targets. I decided to use a simple three-column tab-separated format with the query id in the first column the target id in the second column, and the score in the third, like the following:
Amygdalin CHEMBL2303964 0.796 Amygdalin CHEMBL2303958 0.700 Boron nitride * * Bicarbonate CHEMBL1353 0.875 Bicarbonate CHEMBL2106975 0.875 Benzoic acid CHEMBL541 1.000
A problem with this format is there's no obvious way to say that a query has no hits. Perhaps those should be omitted from the output? What I decided was to have one output line for that case, with *
for the target id and score. You can see that in the Boron nitride
line, above.
A search implementation
Assuming targets
is a list of (target id, target fingerprint) pairs, and query_reader
is an iterator of (query id, query fingerprint) pairs (either from read_fingerprints() or read_structure_fingerprints_from_smiles_file()) then the following shows the structure of the main search loop and reporting.
for query_id, query_fp in query_reader: has_match = False # Check each of the targets for target_id, target_fp in targets: score = similarity_function(query_fp, target_fp) # If it's similar enough, print details. if score >= threshold: has_match = True print(f"{query_id}\t{target_id}\t{score:.3f}") # Print something in case there were no matches. if not has_match: print(f"{query_id}\t*\t*")
(The final code includes a count of the number of queries, which is used for reporting.)
Command-line options with argparse
There are several Python packages for parsing command-line arguments. I use argparse because it's part of Python's standard library. Many are fond of click as a third-party package.
I decided I wanted to allow a single query SMILES input or input from either a SMILES file or an FPS file. I also wanted to be able to change the threshold, specify the location of the chembl_27.fps file (in case it isn't in the current directory), and support a --times
option to report the load and search times.
Here's what the --help
from the program looks like:
usage: chembl_in_memory_search.py [-h] [--threshold THRESHOLD] [--query SMILES | --queries FILENAME] [--times] [FILENAME] in-memory ChEMBL fingerprint search positional arguments: FILENAME location of chembl_27.fps optional arguments: -h, --help show this help message and exit --threshold THRESHOLD, -t THRESHOLD minimum threshold (default: 0.7) --query SMILES query smiles --queries FILENAME input query file (SMILES or FPS) --times print timing information to stderr
This sort of configuration is pretty easy to do with argparse. The hardest thing is that the single-query SMILES search and the file-based search are exclusive, meaning they can't both be specified at the same time.
parser = argparse.ArgumentParser( description = "in-memory ChEMBL fingerprint search") parser.add_argument("--threshold", "-t", type=float, default=0.7, help="minimum threshold (default: 0.7)") g = parser.add_mutually_exclusive_group() g.add_argument("--query", metavar="SMILES", help="query smiles") g.add_argument("--queries", metavar="FILENAME", help="input query file (SMILES or FPS)") parser.add_argument("--times", action="store_true", help="print timing information to stderr") parser.add_argument("targets", metavar="FILENAME", default="chembl_27.fps", nargs="?", help="location of chembl_27.fps")
An argparse.ArgumentParser() instance (like parser
in the above) is usually used something like:
args = parser.parse_args() print("Threshold:", args.threshold)
That is, parse_args() parses the command line and returns an argparse.Namespace() instance with the parsed command-line parameters accessible as attributes.
Setting up the query reader
The last tricky part to explain is how to handle both --query SMILES
and --queries FILENAME
. I decided to have a function which returns a query reader
, that is, something which can be iterated over to get the successive pairs of (query id, query fingerprint). I have all the parts - I just need to hook them together correctly. And, if neither option is specified, I'll have the program search for caffeine, with a warning that that's what it will do.
# Helper function to figure out how the queries are specified def open_query_reader(args): if args.query is None: if args.queries is None: # Default search sys.stderr.write("No query specified, using caffeine.\n") query_fp = get_query_fp("CN1C=NC2=C1C(=O)N(C(=O)N2C)C") return [("caffeine", query_fp)] elif args.queries.endswith(".smi"): # If --queries ends with ".smi" return read_structure_fingerprints_from_smiles_file(open(args.queries)) else: # Otherwise, assume it's an FPS file return read_fingerprints(open(args.queries)) else: # User specified a --query SMILES try: query_fp = get_query_fp(args.query) except ValueError as err: raise SystemExit("Cannot parse --query SMILES {args.query!r}") return [("Query", query_fp)]
chembl_in_memory_search.py
I've walked through all of the the important parts, though there are a few missing connector pieces. Rather than show it here, you can download chembl_in_memory_search.py for yourself. You'll need to download and run popc.py in order to generate the _popc Python module. (See Monday's essay for more details.)
Search Wikipedia structures
The fun Wikipedia Chemical Structure Explorer is an interactive tool to explore the chemical structures scrapped from Wikipedia. The structures can be downloaded as a SMILES file. It currently contains 17,846 structures, but it's NOT a SMILES file because the id and SMILES fields are swapped. I'll extract the first 10 to use for my timings and use awk to swap the two fields (the -F\t
configures it to use the tab character as the field seperator):
% head -10 smiles.txt | awk '-F\t' '{print $2, $1}' > wikipedia_10.smi
Many if not most of the Wikipedia structures are in ChEMBL, along with similar structures, so I'll use a high similarity threshold in order to limit the number of lines in the output:
% python chembl_in_memory_search.py --queries wikipedia_10.smi --threshold 0.9 Ammonia CHEMBL1160819 1.000 Aspirin CHEMBL25 1.000 Acetylene CHEMBL116336 1.000 Adenosine triphosphate CHEMBL14249 1.000 Adenosine triphosphate CHEMBL339385 0.942 Adenosine triphosphate CHEMBL437508 0.942 Adenosine triphosphate CHEMBL490984 1.000 Adenosine triphosphate CHEMBL591905 0.942 Adenosine triphosphate CHEMBL407938 0.942 Adenosine triphosphate CHEMBL14830 0.981 Adenosine triphosphate CHEMBL1331383 0.942 Adenosine triphosphate CHEMBL1626485 0.942 Adenosine triphosphate CHEMBL1610462 0.925 Adenosine triphosphate CHEMBL1162201 0.942 Adenosine triphosphate CHEMBL4298329 0.942 Ampicillin CHEMBL174 1.000 Ampicillin CHEMBL453388 0.978 Ampicillin CHEMBL1078033 0.978 Ampicillin CHEMBL1473908 1.000 Chemistry of ascorbic acid CHEMBL196 1.000 Chemistry of ascorbic acid CHEMBL486293 1.000 Chemistry of ascorbic acid CHEMBL1161421 1.000 Chemistry of ascorbic acid CHEMBL196 1.000 Chemistry of ascorbic acid CHEMBL486293 1.000 Chemistry of ascorbic acid CHEMBL1161421 1.000 Amphetamine CHEMBL405 1.000 Amphetamine CHEMBL19393 1.000 Amphetamine CHEMBL612 1.000 Amphetamine CHEMBL287386 0.950 Amphetamine CHEMBL554211 0.950 Amphetamine CHEMBL1788294 0.950 Aspartame CHEMBL171679 1.000 Aspartame CHEMBL312245 1.000 Aspartame CHEMBL2110238 1.000 Amoxicillin CHEMBL1082 1.000 Amoxicillin CHEMBL1367635 1.000 Amoxicillin CHEMBL1433952 1.000
If I add the --times
option, then I see I'm doing about 2 searches per second:
load: 4.57 search: 5.01 #queries: 10 #/sec: 2.00
The equivalent chemfp search
Chemfp is a high-performance cheminformatics similarity search package I sell. Its command-line is similar to the chembl_in_memory_search.py program I developed above. I'll try it out:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps > /dev/null open 0.00 read 0.00 search 8.55 output 0.00 total 8.56
This means it takes about 9 seconds to search the file, which is about the same time as the simple Python code I wrote!
--scan vs. --memory searches
That was unexpected. The problem is that chemfp has two search modes. The --scan
mode searches the file chunk by chunk (like Monday' essay), while --memory
loads the data set into memory before searching.
Years ago, the tradeoff was about 20 queries. Now I see it's much smaller! If I ask it to do an in-memory search then it takes only 5.7 seconds total:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fps --memory > /dev/null open 5.60 read 0.00 search 0.07 output 0.00 total 5.67
This means it took 5.6 seconds to read the fingerprints and organize them for fast search, and the search itself took only 0.07 seconds.
Looks like it's time to re-tune that parameter!
chemfp's FPB format
One reason I didn't notice this until now is that I tend to work with chemfp's FPB format, which is optimized for fast loading. It takes all of the FPS readers over 4 seconds to parse the file to put the data into a format which is more appropriate for faster searches.
The FPB format organizes the fingerprint data so it can loaded quickly. In fact, if the file is uncompressed it can be memory-mapped and used directly - for some searches chemfp doesn't even need to load the full data set!
First, I'll convert the FPS file to FPB format:
% fpcat chembl_27.fps -o chembl_27.fpb
I'll then do the same search with 10 Wikipedia queries using a warm disk cache:
% simsearch --queries wikipedia_10.smi --threshold 0.9 --times chembl_27.fpb > /dev/null open 0.00 read 0.00 search 0.22 output 0.00 total 0.23
That's about 22 milliseconds per query, with very little load time needed. This is fast enough to include a ChEMBL similarity search in a scripting environment.
chemfp with 1,000 Wikipedia queries
I then tried with nearly 1,000 Wikipedia queries (RDKit could not parse a couple of structures; I've omitted RDKit's warning and error messages) using chemfp and the FPS and FPB formats:
% simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fps > /dev/null open 5.92 read 0.26 search 2.58 output 0.01 total 8.77 % simsearch --queries wikipedia_1000.smi --threshold 0.9 --times chembl_27.fpb > /dev/null open 0.02 read 0.25 search 2.77 output 0.02 total 3.07With 1,000 queries the
readtime is now noticeable; that includes the time for RDKit to parse the SMILES string and generate the Morgan fingerprint. In case you are curious, the performance is over 100x faster than the simple in-memory program I developed for this essay:
% python chembl_in_memory_search.py --queries wikipedia_1000.smi --threshold 0.9 --times > /dev/null load: 4.69 search: 499.65 #queries: 994 #/sec: 1.99
chemfp advertisement
You can install and test chemfp chemfp (under the Base License Agreement) by doing:
python -m pip install chemfp -i https://chemfp.com/packages/The Base License Agreement does not allow you to create FPB files or do an in-memory search with more than 50,000 fingerprints. If you are interested in those features, take a look at the other licensing options then contact me to ask for an evaluation license key.
from Planet Python
via read more
No comments:
Post a Comment