Sunday, November 21, 2021

Anarcat: mbsync vs OfflineIMAP

After recovering from my latest email crash (previously, previously), I had to figure out which tool I should be using. I had many options but I figured I would start with a popular one (mbsync).

But I also evaluated OfflineIMAP which was resurrected from the Python 2 apocalypse, and because I had used it before, for a long time.

Read on for the details.

Benchmark setup

All programs were tested against a Dovecot 1:2.3.13+dfsg1-2 server, running Debian bullseye.

The client is a Purism 13v4 laptop with a Samsung SSD 970 EVO 1TB NVMe drive.

The server is a custom build with a AMD Ryzen 5 2600 CPU, and a RAID-1 array made of two NVMe drives (Intel SSDPEKNW010T8 and WDC WDS100T2B0C).

The mail spool I am testing against has almost 400k messages and takes 13GB of disk space:

$ notmuch count --exclude=false
372758
$ du -sh --exclude xapian Maildir
13G Maildir

The baseline we are comparing against is SMD (syncmaildir) which performs the sync in about 7-8 seconds locally (3.5 seconds for each push/pull command) and about 10-12 seconds remotely.

Anything close to that or better is good enough. I do not have recent numbers for a SMD full sync baseline, but the setup documentation mentions 20 minutes for a full sync. That was a few years ago, and the spool has obviously grown since then, so that is not a reliable baseline.

A baseline for a full sync might be also set with rsync, which copies files at nearly 40MB/s, or 317Mb/s!

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
 12,647,814,731 100%   37.85MB/s    0:05:18 (xfr#394981, to-chk=0/395815)    
72.38user 106.10system 5:19.59elapsed 55%CPU (0avgtext+0avgdata 15988maxresident)k
8816inputs+26305112outputs (0major+50953minor)pagefaults 0swaps

That is 5 minutes to transfer the entire spool. Incremental syncs are obviously pretty fast too:

anarcat@angela:tmp(main)$ time rsync -a --info=progress2 --exclude xapian  shell.anarc.at:Maildir/ Maildir/
              0   0%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/395815)    
1.42user 0.81system 0:03.31elapsed 67%CPU (0avgtext+0avgdata 14100maxresident)k
120inputs+0outputs (3major+12709minor)pagefaults 0swaps

As an extra curiosity, here's the performance with tar, pretty similar with rsync, minus incremental which I cannot be bothered to figure out right now:

anarcat@angela:tmp(main)$ time ssh shell.anarc.at tar --exclude xapian -cf - Maildir/ | pv -s 13G | tar xf - 
56.68user 58.86system 5:17.08elapsed 36%CPU (0avgtext+0avgdata 8764maxresident)k
0inputs+0outputs (0major+7266minor)pagefaults 0swaps
12,1GiO 0:05:17 [39,0MiB/s] [===================================================================> ] 92%

Interesting that rsync manages to almost beat a plain tar on file transfer, I'm actually surprised by how fast it performs here, considering there are many little files to transfer.

Finally, note that when I migrated to SMD, I wrote a small performance comparison that could be interesting here. It show SMD to be faster than OfflineIMAP, but not as much as we see here. In fact, it looks like OfflineIMAP slowed down significantly since I last did tests on it, but this could be due to my larger mail spool.

mbsync

The isync (AKA mbsync) project is written in C and supports syncing Maildir and IMAP folders, with possibly multiple replicas. I haven't tested this but I suspect it might be possible to sync between two IMAP servers as well. It supports partial mirorrs, message flags, full folder support, and "trash" functionality.

Complex configuration file

I am using this .mbsyncrc configuration file:

SyncState *
Sync New ReNew Flags

IMAPAccount anarcat
Host imap.anarc.at
User anarcat
PassCmd "pass imap.anarc.at"
SSLType IMAPS
CertificateFile /etc/ssl/certs/ca-certificates.crt

IMAPStore anarcat-remote
Account anarcat

MaildirStore anarcat-local
# Maildir/top/sub/sub
#SubFolders Verbatim
# Maildir/.top.sub.sub
SubFolders Maildir++
# Maildir/top/.sub/.sub
# SubFolders legacy
# The trailing "/" is important
#Path ~/Maildir-mbsync/
Inbox ~/Maildir-mbsync/

Channel anarcat
# AKA Far, convert when all clients are 1.4+
Master :anarcat-remote:
# AKA Near
Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
Patterns *
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both

Long gone are the days where I would spend a long time reading a manual page to figure out the meaning of every option. If that's your thing, you might like this one. But I'm more of a "EXAMPLES section" kind of person now, and I somehow couldn't find a sample file on the website. I started from the Arch wiki one but it's actually not great because it's made for Gmail (which is not a usual Dovecot server). So a sample config file in the manpage would be a great addition. Thankfully, the Debian packages ships one in /usr/share/doc/isync/examples/mbsyncrc.sample but I only found that after I wrote my configuration. It was still useful and I recommend people take a look if they want to understand the syntax.

Also, that syntax is a little overly complicated. For example, Far needs colons, like:

Far :anarcat-remote:

Why? That seems just too complicated. I also found that sections are not clearly identified: IMAPAccount and Channel mark section beginnings, for example. There are also weird ordering issues: the SyncState option needs to be before IMAPAccount, presumably because it's global.

Using a more standard format like .INI or TOML would improve things greatly here.

Stellar performance

The full sync of my 300,000+ messages (12GB) took 56 minutes and 6 seconds, which is impressive.

It's not quite "line rate": the resulting mail spool was 12GB (which is a problem, see below), which turns out to be about 29Mbit/s and therefore not maxing the gigabit link. That said, it's possible disks were the bottleneck on either end, although I'd be surprised by that: both are NVMe drives, so that should easily saturate a gigabit link. And in fact, a remote backup of the mail spool achieves much faster transfer rate on disks, so I do not believe the server is to blame here:

anarcat@marcos:~$ tar fc - Maildir | pv -s 13G > Maildir.tar
15,0GiO 0:01:57 [ 131MiB/s] [===================================] 115%

That's 131Mibyyte per second, vastly faster than the gigabit link. The client has similar performance:

anarcat@angela:~(main)$ tar fc - Maildir | pv -s 17G > Maildir.tar
16,2GiO 0:02:22 [ 116MiB/s] [==================================] 95%

So yes, those disks should be able to saturate a gigabit link, they are not the bottleneck.

The incremental runs are roughly 2 seconds, which is even more impressive:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.015       0.052       1.930       2.029       2.105       
user        0.660       0.040       0.592       0.661       0.722       
sys         0.338       0.033       0.268       0.341       0.387    

Those tests were performed with isync 1.3.0-2.2 on Debian bullseye. Tests with a newer isync release failed because of a corrupted message that triggered bug 999804. Running 1.4.3 under valgrind works around the bug, but adds a 50% performance cost, the full sync running in 1h35m.

A test with 1.4.3 yielded many crashes and stability issues (see below). Performance is fairly similar, considering that the new sync included the register folder with 4000 messages:

120.74user 213.19system 59:47.69elapsed 9%CPU (0avgtext+0avgdata 105420maxresident)k
29128inputs+28284376outputs (0major+45711minor)pagefaults 0swaps

That is ~13GB in ~60 minutes, which gives us 28.3Mbps, fairly close to the 1.3 benchmark. Incrementals are also pretty similar, again considering the double-connect cost:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.500       0.087       2.340       2.491       2.629       
user        0.718       0.037       0.679       0.711       0.793       
sys         0.322       0.024       0.284       0.320       0.365

Those tests were all done on a Gigabit link, but what happens on a slower link? My server uplink is surprisingly slow: 25/6 Mbps. There mbsync doesn't far as well, and is worse than the SMD baseline:

1.80user 0.54system 0:35.70elapsed 6%CPU (0avgtext+0avgdata 25984maxresident)k
0inputs+0outputs (0major+19315minor)pagefaults 0swaps

That's 35 seconds for a sync, which is an order of magnitude slower than SMD.

TODO: multitime

Great user interface

Compared to OfflineIMAP and (ahem) SMD, the mbsync UI is kind of neat:

anarcat@angela:~(main)$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 1/2  B: 204/205  F: +0/0 *0/0 #0/0  N: +1/2 *0/0 #0/0

(Note that nice switch away from slavery too.)

The display is minimalist, and yet informative. It's not obvious from the start what it does, but the manpage is instructive at least on that, and you get used to it:

This represents the cumulative progress over channels, boxes, and messages affected on the far and near side, respectively. The message counts represent added messages, messages with updated flags, and trashed messages, respectively. No attempt is made to calculate the totals in advance, so they grow over time as more information is gathered. (Emphasis mine).

In other words:

  • C 2/2: channels done/total (2 done out of 2)
  • B 204/205: mailboxes done/total (204 out of 205)
  • F: changes on the far side
  • N: +1/2 *0/0 #0/0: changes on the "near" side:
    • +1/2: one out of two messages downloaded
    • *0/0: no flag changed
    • #0/0: no message deleted

I like it. It does not, unfortunately, show up when you run it in systemd, which is a bit annoying as I like to see mail traffic in the logs.

Interoperability issue

In my notmuch setup, I have bound key S to "mark spam", which basically assigns the tag spam to the message and removes a bunch of others. Then I have a notmuch-purge script which moves that message to the spam folder, for training purposes. It basically does this:

notmuch search --output=files --format=text0 "$search_spam" \
    | xargs -r -0 mv -t "$HOME/Maildir/${PREFIX}junk/cur/"

This method, which worked fine in SMD (and also OfflineIMAP) created this error on sync:

Maildir error: duplicate UID 37578.

And indeed, there are now two messages with that UID in the mailbox:

anarcat@angela:~(main)$ find Maildir/.junk/ -name '*U=37578*'
Maildir/.junk/cur/1637427889.134334_2.angela,U=37578:2,S
Maildir/.junk/cur/1637348602.2492889_221804.angela,U=37578:2,S

This is actually a known limitation or, as mbsync(1) calls it, a "RECOMMENDATION":

When using the more efficient default UID mapping scheme, it is important that the MUA renames files when moving them between Maildir fold ers. Mutt always does that, while mu4e needs to be configured to do it:

(setq mu4e-change-filenames-when-moving t)

So it seems I would need to fix my script. It's unclear how the paths should be renamed, which is unfortunate, because I would need to change my script to adapt to mbsync, but I can't tell how just from reading the above.

Fortunately, someone else already fixed that issue: afew, a notmuch tagging script (much puns, such hurt), has a move mode that can rename files correctly, specifically designed to deal with mbsync. I had already been told about afew, but it's one more reason to standardize my notmuch hooks on that project, it looks like.

(A manual fix is actually to rename the file to remove the U= field: mbsync will generate a new one and then sync correctly.)

Stability issues

The newer release in Debian bookworm (currently at 1.4.3) has stability issues on full sync. I filed bug 999804 in Debian about this, which lead to a thread on the upstream mailing list. I have found at least three distinct crashes that could be double-free bugs "which might be exploitable in the worst case", not a reassuring prospect.

The thing is: mbsync is really fast, but the downside of that is that it's written in C, and with that comes a whole set of security issues. The Debian security tracker has only three CVEs on isync, but the above issues show there could be many more.

Reading the source code certainly did not make me very comfortable with trusting it with arbitrary data. I have considered sandboxing it with systemd (below) but having systemd run as a --user process makes that difficult. I have also considered using an apparmor profile but that is not trivial because we need to allow SSH and only some parts of it...

Thankfully, upstream has been diligent at addressing the issues I have found providing a patch within a few days which did fix the sync issues, including on a full sync from scratch.

Automation with systemd

The Arch wiki has instructions on how to setup mbsync as a systemd service. It suggests using the --verbose (-V) flag which is a little intense here, as it outputs 1444 lines of messages.

I have used the following .service file:

[Unit]
Description=Mailbox synchronization service
ConditionHost=!marcos
Wants=network-online.target
After=network-online.target
Before=notmuch-new.service

[Service]
Type=oneshot
ExecStart=/usr/bin/mbsync -a
Nice=10
IOSchedulingClass=idle
NoNewPrivileges=true

[Install]
WantedBy=default.target

And the following .timer:

[Unit]
Description=Mailbox synchronization timer
ConditionHost=!marcos

[Timer]
OnBootSec=2m
OnUnitActiveSec=5m
Unit=mbsync.service

[Install]
WantedBy=timers.target

Notice how we marked the .service to Wants notmuch, so that notmuch new gets triggered. This is notmuch-new.service at the time of writing:

[Unit]
Description=notmuch new
After=mbsync.service

[Service]
Type=oneshot
Nice=10
ExecStart=/usr/bin/notmuch new

[Install]
WantedBy=mbsync.service

TODO: an improvement would be to use IMAP notify but neither imapnotify nor goimapnotify seem to be packaged in Debian.

Password-less setup

The sample file suggests this should work:

IMAPStore remote
Tunnel "ssh -q host.remote.com /usr/sbin/imapd"

Add BatchMode, restrict to IdentitiesOnly, provide a password-less key just for this, add compression (-C), find the Dovecot imap binary, and you get this:

IMAPAccount anarcat-tunnel
Tunnel "ssh -o BatchMode=yes -o IdentitiesOnly=yes -i ~/.ssh/id_ed25519_mbsync -o HostKeyAlias=shell.anarc.at -C anarcat@imap.anarc.at /usr/lib/dovecot/imap"

And it actually seems to work:

$ mbsync -a
Notice: Master/Slave are deprecated; use Far/Near instead.
C: 0/2  B: 0/1  F: +0/0 *0/0 #0/0  N: +0/0 *0/0 #0/0imap(anarcat): Error: net_connect_unix(/run/dovecot/stats-writer) failed: Permission denied
C: 2/2  B: 205/205  F: +0/0 *0/0 #0/0  N: +1/1 *3/3 #0/0imap(anarcat)<1611280><90uUOuyElmEQlhgAFjQyWQ>: Info: Logged out in=10808 out=15396642 deleted=0 expunged=0 trashed=0 hdr_count=0 hdr_bytes=0 body_count=1 body_bytes=8087

It's a bit noisy, however. dovecot/imap doesn't have a "usage" to speak of, but even the source code doesn't hint at a way to disable that Error message, so that's unfortunate. That socket is owned by root:dovecot so presumably Dovecot runs the imap process as $user:dovecot, which we can't do here. Oh well?

Interestingly, the SSH setup is not faster than IMAP.

With IMAP:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.367       0.065       2.220       2.376       2.458       
user        0.793       0.047       0.731       0.776       0.871       
sys         0.426       0.040       0.364       0.434       0.476

With SSH:

===> multitime results
1: mbsync -a
            Mean        Std.Dev.    Min         Median      Max
real        2.515       0.088       2.274       2.532       2.594       
user        0.753       0.043       0.645       0.766       0.804       
sys         0.328       0.045       0.212       0.340       0.393

Basically: 200ms slower. Tolerable.

SMD migration procedure

First, disable SMD on all clients:

systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer

Run notmuch-sync-flagged and notmuch dump everywhere.

Do one last smd-pull and smd-push by hand everywhere.

Make a backup on both the client and the server:

cp -al Maildir Maildir-bak
tar fc - Maildir | pv -s 13G > Maildir.tar
notmuch dump > notmuch.dump

Then, the critical section begins.

Run a full sync on the client, with the Inbox pointing at Maildir-mbsync to avoid overwriting the current Maildir:

mbsync -a

This should take an hour on a gigabit link, much longer on remote. When that is done, move the mbsync Maildir in place:

mv Maildir Maildir-smd
mv Maildir-mbsync Maildir

Fix the .mbsyncrc to write there:

sed -i 's#Inbox ~/Maildir-mbsync/#Inbox ~/Maildir/#' .mbsyncrc

Enable full syncing by switching the Create and Expunge parameters to Both. Consider enabling Remove Both as well. Change to Sync All.

Do a manual run:

mbsync -a

Rebuild the notmuch database, as it will re-index all the messages, so it and will need to do a lot of work:

notmuch new

This should take around 25 minutes. Then restore the tags from backup:

pv notmuch.dump | notmuch restore

We're doing a notmuch new from scratch because it's faster than doing it on top of the existing DB, surprisingly. The fresh indexing took:

nov 19 15:08:54 angela notmuch[2521117]: Processed 384679 total files in 23m 41s (270 files/sec.).
nov 19 15:08:54 angela notmuch[2521117]: Added 372610 new messages to the database.

While a reindexing on top of an existing database was going twice as slow, at about 120 files/sec.

If all goes well, we're basically almost done. Try the same through systemd and watch the logs:

systemctl --user start mbsync

If all is still well, enable the systemd service:

systemctl --user enable mbsync.timer notmuch-new.service
systemctl --user start mbsync.timer

Remove the smd* jobs from systemd and fix notmuch-new.service.

During the migration, notmuch helpfully told me the full list of those lost messages:

[...]
Warning: cannot apply tags to missing message: CAN6gO7_QgCaiDFvpG3AXHi6fW12qaN286+2a7ERQ2CQtzjSEPw@mail.gmail.com
Warning: cannot apply tags to missing message: CAPTU9Wmp0yAmaxO+qo8CegzRQZhCP853TWQ_Ne-YF94MDUZ+Dw@mail.gmail.com
Warning: cannot apply tags to missing message: F5086003-2917-4659-B7D2-66C62FCD4128@gmail.com
[...]
Warning: cannot apply tags to missing message: mailman.2.1316793601.53477.sage-members@mailman.sage.org
Warning: cannot apply tags to missing message: mailman.7.1317646801.26891.outages-discussion@outages.org
Warning: cannot apply tags to missing message: notmuch-sha1-000458df6e48d4857187a000d643ac971deeef47
Warning: cannot apply tags to missing message: notmuch-sha1-0079d8e0c3340e6f88c66f4c49fca758ea71d06d
Warning: cannot apply tags to missing message: notmuch-sha1-0194baa4cfb6d39bc9e4d8c049adaccaa777467d
Warning: cannot apply tags to missing message: notmuch-sha1-02aede494fc3f9e9f060cfd7c044d6d724ad287c
Warning: cannot apply tags to missing message: notmuch-sha1-06606c625d3b3445420e737afd9a245ae66e5562
Warning: cannot apply tags to missing message: notmuch-sha1-0747b020f7551415b9bf5059c58e0a637ba53b13
[...]

As detailed in the crash report, all of those were actually innocuous and could be ignored.

New workstation setup

The above was how I migrated to mbsync on my first workstation. The work on the second one was more streamlined, especially since the corruption on mailboxes was fixed:

  1. install isync, with the patch:

    dpkg -i isync_1.4.3-1.1~_amd64.deb
    
  2. copy all files over from previous workstation to avoid a full resync (optional):

    rsync -a --info=progress2 angela:Maildir/ Maildir-mbsync/
    
  3. rename all files to match new hostname (optional):

    find Maildir-mbsync/ -type f -name '*.angela,*' -print0 |  rename -0 's/\.angela,/\.curie,/'
    
  4. backup the notmuch directory

    notmuch dump | pv > notmuch.dump
    
  5. trash the notmuch database:

    rm -rf Maildir-mbsync/.notmuch/xapian/
    
  6. move old files aside, if present

    mv Maildir Maildir-smd
    
  7. move new files in place:

    mv Maildir-mbsync Maildir
    
  8. create the SSH key:

    ssh-keygen -t ed25519 -f .ssh/id_ed25519_mbsync
    cat .ssh/id_ed25519_mbsync.pub
    
  9. add to .ssh/authorized_keys on the server, like this:

    command="/usr/lib/dovecot/imap",restrict ssh-ed25519 AAAAC...
    
  10. run a test sync, with Create slave, no Expunge or Remove statements:

    mbsync register
    
  11. if that works well, try with all mailboxes:

    mbsync -a
    
  12. if that works well, enable Create both, Expunge both and Remove both, and try again:

    mbsync register
    mbsync -a
    
  13. reindex the notmuch database

    notmuch new pv notmuch.dump | notmuch restore

OfflineIMAP

I've used OfflineIMAP for a long time before switching to SMD. I don't exactly remember why I started using it, but I do remember it was painfully slow, and would sometimes crash.

It also kind of died in a fire when Python 2 stop being maintained. The main author moved on to a different project, imapfw which could serve as a framework to build IMAP clients, but never seemed to get back to the OfflineIMAP feature set. Thankfully, a new team of volunteers ported OfflineIMAP to Python 3 and we can now test that new version to see if it is an improvement over mbsync.

Crash on full sync

The first thing we notice when doing a full sync is this crash:

Copy message from RemoteAnarcat:junk:
 ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')
Thread 'Copy message from RemoteAnarcat:junk' terminated with exception:
Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/imaputil.py", line 406, in utf7m_decode
    for c in binary.decode():
AttributeError: 'memoryview' object has no attribute 'decode'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/share/offlineimap3/offlineimap/threadutil.py", line 146, in run
    Thread.run(self)
  File "/usr/lib/python3.9/threading.py", line 892, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)
AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')


Last 1 debug messages logged for Copy message from RemoteAnarcat:junk prior to exception:
thread: Register new thread 'Copy message from RemoteAnarcat:junk' (account 'Anarcat')
ERROR: Exceptions occurred during the run!
ERROR: Copying message 30624 [acc: Anarcat]
  decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 802, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 342, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 908, in _fetch_from_imap
    ndata1 = self.parser['8bit-RFC'].parsebytes(data[0][1])
  File "/usr/lib/python3.9/email/parser.py", line 123, in parsebytes
    return self.parser.parsestr(text, headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.9/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.9/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 298, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 385, in _parsegen
    for retval in self._parsegen():
  File "/usr/lib/python3.9/email/feedparser.py", line 256, in _parsegen
    if self._cur.get_content_type() == 'message/delivery-status':
  File "/usr/lib/python3.9/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.9/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.9/email/policy.py", line 163, in header_fetch_parse
    return self.header_factory(name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 601, in __call__
    return self[name](name, value)
  File "/usr/lib/python3.9/email/headerregistry.py", line 196, in __new__
    cls.parse(value, kwds)
  File "/usr/lib/python3.9/email/headerregistry.py", line 445, in parse
    kwds['parse_tree'] = parse_tree = cls.value_parser(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2675, in parse_content_type_header
    ctype.append(parse_mime_parameters(value[1:]))
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2569, in parse_mime_parameters
    token, value = get_parameter(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2492, in get_parameter
    token, value = get_value(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 2403, in get_value
    token, value = get_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1294, in get_quoted_string
    token, value = get_bare_quoted_string(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1223, in get_bare_quoted_string
    token, value = get_encoded_word(value)
  File "/usr/lib/python3.9/email/_header_value_parser.py", line 1064, in get_encoded_word
    text, charset, lang, defects = _ew.decode('=?' + tok + '?=')
  File "/usr/lib/python3.9/email/_encoded_words.py", line 181, in decode
    string = bstring.decode(charset)

Folder junk [acc: Anarcat]:
 Copy message UID 30626 (29008/49310) RemoteAnarcat:junk -> LocalAnarcat:junk
Command exited with non-zero status 100
5252.91user 535.86system 3:21:00elapsed 47%CPU (0avgtext+0avgdata 846304maxresident)k
96344inputs+26563792outputs (1189major+2155815minor)pagefaults 0swaps

That only transferred about 8GB of mail, which gives us a transfer rate of 5.3Mbit/s, more than 5 times slower than mbsync. This bug is possibly limited to the bullseye version of offlineimap3 (the lovely 0.0~git20210225.1e7ef9e+dfsg-4), while the current sid version (the equally gorgeous 0.0~git20211018.e64c254+dfsg-1) seems unaffected.

Tolerable performance

The new release still crashes, except it does so at the very end, which is an improvement, I guess, since the mails do get transferred:

 *** Finished account 'Anarcat' in 511:12
ERROR: Exceptions occurred during the run!
ERROR: Exception parsing message with ID (<20190619152034.BFB8810E07A@marcos.anarc.at>) from imaplib (response type: bytes).
 AttributeError: decoding with 'X-EUC-TW' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(

ERROR: Exception parsing message with ID (<40A270DB.9090609@alternatives.ca>) from imaplib (response type: bytes).
 AttributeError: decoding with 'x-mac-roman' codec failed (AttributeError: 'memoryview' object has no attribute 'decode')

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 910, in _fetch_from_imap
    raise OfflineImapError(

ERROR: IMAP server 'RemoteAnarcat' does not have a message with UID '32686'

Traceback:
  File "/usr/share/offlineimap3/offlineimap/folder/Base.py", line 810, in copymessageto
    message = self.getmessage(uid)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 343, in getmessage
    data = self._fetch_from_imap(str(uid), self.retrycount)
  File "/usr/share/offlineimap3/offlineimap/folder/IMAP.py", line 889, in _fetch_from_imap
    raise OfflineImapError(reason, severity)

Command exited with non-zero status 1
8273.52user 983.80system 8:31:12elapsed 30%CPU (0avgtext+0avgdata 841936maxresident)k
56376inputs+43247608outputs (811major+4972914minor)pagefaults 0swaps
"offlineimap  -o " took 8 hours 31 mins 15 secs

This is 8h31m for transferring 12G, which is around 3.1Mbit/s. That is nine times slower than mbsync, almost an order of magnitude!

Now that we have a full sync, we can test incremental synchronization. That is also much slower:

===> multitime results
1: sh -c "offlineimap -o || true"
            Mean        Std.Dev.    Min         Median      Max
real        24.639      0.513       23.946      24.526      25.708      
user        23.912      0.473       23.404      23.795      24.947      
sys         1.743       0.105       1.607       1.729       2.002

That is also an order of magnitude slower than mbsync, and significantly slower than what you'd expect from a sync process. ~30 seconds is long enough to make me impatient and distracted; 3 seconds, less so: I can wait and see the results almost immediately.

Integrity check

That said: this is still on a gigabit link. It's technically possible that OfflineIMAP performs better than mbsync over a slow link, but I Haven't tested that theory.

The OfflineIMAP mail spool is missing quite a few messages as well:

anarcat@angela:~(main)$ find Maildir-offlineimap -type f -type f -a \! -name '.*' | wc -l 
381463
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l 
385247

... although that's probably all either new messages or the register folder, so OfflineIMAP might actually be in a better position there. But digging in more, it seems like the actual per-folder diff is fairly similar to mbsync: a few messages missing here and there. Considering OfflineIMAP's instability and poor performance, I have not looked any deeper in those discrepancies.

Other projects to evaluate

Those are all the options I have considered, in alphabetical order

  • doveadm-sync: requires dovecot on both ends, can tunnel over SSH, may have performance issues in incremental sync
  • fdm: fetchmail replacement, IMAP/POP3/stdin/Maildir/mbox,NNTP support, SOCKS support (for Tor), complex rules for delivering to specific mailboxes, adding headers, piping to commands, etc. discarded because no (real) support for keeping mail on the server, and written in C
  • getmail: fetchmail replacement, IMAP/POP3 support, supports incremental runs, classification rules, Python
  • interimap: syncs two IMAP servers, apparently faster than doveadm and offlineimap, but requires running an IMAP server locally
  • isync/mbsync: TLS client certs and SSH tunnels, fast, incremental, IMAP/POP/Maildir support, multiple mailbox, trash and recursion support, and generally has good words from multiple Debian and notmuch people (Arch tutorial), written in C, review above
  • mail-sync: notify support, happens over any piped transport (e.g. ssh), diff/patch system, requires binary on both ends, mentions UUCP in the manpage, mentions rsmtp which is a nice name for rsendmail. not evaluated because it seems awfully complex to setup
  • nncp: treat the local spool as another mail server, not really compatible with my "multiple clients" setup
  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful (IIRC), ways to tunnel over SSH, review above

Most projects were not evaluated due to lack of time.

Conclusion

I'm now using mbsync to sync my mail. I'm a little disappointed by the synchronisation times over the slow link, but I guess that's on par for the course if we use IMAP. We are bound by the network speed much more than with custom protocols. I'm also worried about the C implementation and the crashes I have witnessed, but I am encouraged by the fast upstream response.

Time will tell if I will stick with that setup. I'm certainly curious about the promises of interimap and mail-sync, but I have ran out of time on this project.



from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...