Thursday, September 3, 2020

Andrew Dalke: What's up with chemfp 1.x?

Background: chemfp is a Python package for high-performance cheminformatics fingerprint similarity search. There are two development tracks. Chemfp 1.x is the no-cost/open source version, which only supports Python 2.7, and chemfp 3.x is the more advanced and capable version which supports Python 2.7 and Python 3.6+. Try out chemfp now!

The chemfp 1.x track is still being maintained and updated under a no-cost/open source license. It only supports Python 2.7, which is no longer supported by the core Python developers, so you might easily wonder "why?" and "who will use it?"

To summarize: I hope it will be used in future benchmarking and I expect Python 2.7 will be available for several more years.

Reference benchmarking tool

I expect chemfp's primary use will be as a solid baseline for high-performance similarity search. I was and still am annoyed by papers which compare others algorithms with a far-from-optimal baseline brute-force or BitBound implementation which the authors wrote themselves. If a new algorithm is reported as 10x faster than a baseline, and that baseline is 1/100th the performance that the machine is capable of, are the paper's results meaningful? I don't think so.

The thing is, it's very easy to write a Tanimoto similarity search tool. Peter Willett told me that after their paper came out, he very soon heard that it was implemented at one pharma in an afternoon. Which is entirely reasonable.

My observation, however, is that it's hard to write a fast search tool. As I pointed in my chemfp paper, there is a 10-fold performance difference between the first chemfp release and the current one, and that early implementation was already significantly faster than one published baseline.

Now, the latest version of chemfp 1.x is not as fast as chemfp 3.x. In particular the 166-bit MACCS key search and 1024-bit and 2048-bit searches are significantly faster in 3.x on modern hardware. But in many cases 1.x is only a bit slower than 3.x, and in the worst case only about 2x slower, so 1.x is certainly a worthy baseline.

Python 1.x limitations

Performance is only one factor when choosing a tool.

32-bit fingerprint arena indices

Chemfp 1.x uses a 32-bit signed integer for indexing, which limits the fingerprint arena size to 231-1 bytes. That's about 16M 2048-bit fingerprints or 38M PubChem fingerprints. While this is larger than ChEMBL and many in-house collections, it's there are ~100M records in PubChem and one of my customers uses chemfp to search well over 1 billion fingerprints.

Python 2.7 only

Another limitation is that chemfp 1.x only runs under Python 2.7, and Python 2.7 reached its official "end of life" on 1 January 2020. That means that no new bug reports, fixes, or changes will be made to Python 2, and Python 2 is no longer supported.

That should be qualified as: no longer supported by the core Python developers. Other vendors still support Python 2.7 as part of their long-term support agreements. For example, RedHat will support Python 2.7 in RHEL8 until June 2024.

I expect other vendors, including Debian and Ubuntu, will also support Python 2.7 for several years to come. And of course there will still be the option of compiling Python 2.7 from source. Chemfp does not require any other package so will continue be installable even after the PyPI (eventually?) drops Python 2.7 support.

I further qualify "eventuk you for your dedication! I hope you've enjoyed it. Have you considered switching to chemfp 3.x? Or a hybrid solution using chemfp 3.x to generate fingerprints and chemfp 1.x for search?

My goal now is to maintain chemfp 1.x mostly for benchmarking purposes, so you won't see much in the way of other sorts of improvements.

In particular, I no longer have a way to test chemfp 1.x against the chemistry toolkits. My testing environment, which included various versions of each of the toolkits, was too hard-coded to my old machine. Nowadays we can use Docker for those sorts of portability issues, but that wasn't available 8 years ago, and rather a nuisance to set up given the current goals. (I do test chemfp's similarity search functions using Docker.)

Can you port chemfp 1.x yourself?

Yes, absolutely. chemfp 1.x is distributed under the MIT license. You are free to port it to Python 3.x, add Unicode support, change the indexing from 32- to 64-bit, and ensure it works with the more recent toolkits.

Bear in mind that it took a couple of months for me to do that. Now, I wanted to support Python 2.7 and Python 3.5+, which made things more complicated, but on the other hand, I wrote the code in the first place.

Even if you do all that, chemfp 3.x has other features, including:

  • Faster similarity search;
  • the FPB binary format for fast fingerprint data loading;
  • Tversky search;
  • Support for ZStandard compression;
  • "web-enabled" APIs to work with string;
  • Updates to handle new toolkit file formats and fingerprints;
  • A portable "toolkit" API to work with those toolkit.

If you are interested in writing your own tool, I think you should look to the chemfp code for ideas about how to implement fast search, but otherwise develop your own tool for your own needs.

Are you interested in trying out chemfp? Go to the chemfp home page, download it, see the available licensing terms, and view the extensive documentation.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...