My syncmaildir (SMD) setup failed me one too many times (previously, previously). In an attempt to migrate to an alternative mail synchronization tool, I looked into using my IMAP server again, and found out my mail spool was in a pretty bad shape. I'm comparing mbsync and offlineimap in the next post.
The latest crash
On Monday, SMD just started failing with this error:
nov 15 16:12:19 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:12:22 angela systemd[2305]: smd-pull.service: Succeeded.
nov 15 16:12:22 angela systemd[2305]: Finished pull emails with syncmaildir.
nov 15 16:14:08 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:14:11 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:14:11 angela systemd[2305]: Failed to start pull emails with syncmaildir.
nov 15 16:16:14 angela systemd[2305]: Starting pull emails with syncmaildir...
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Unable to get any data from the other endpoint.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: This problem may be transient, please retry.
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Hint: did you correctly setup the SERVERNAME variable
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: on your client? Did you add an entry for it in your ssh
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: configuration file?
nov 15 16:16:17 angela smd-pull[27178]: smd-client: ERROR: Network error
nov 15 16:16:17 angela smd-pull[27188]: register: smd-client@localhost: TAGS: error::context(handshake) probable-cause(network) human-intervention(avoidable) suggested-actions(retry)
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
nov 15 16:16:17 angela systemd[2305]: smd-pull.service: Failed with result 'exit-code'.
nov 15 16:16:17 angela systemd[2305]: Failed to start pull emails with syncmaildir.
What is frustrating is that there's actually no network error here. Running the command by hand I did see a different message, but now I have lost it in my backlog. It had something to do with a filename being too long, and I gave up debugging after a while. This happened suddenly too, which added to the confusion.
In a fit of rage I started this blog post and experimenting with alternatives, which led me down a lot of rabbit holes.
Reviewing my previous mail crash documentation, it seems most solution involve talking to an IMAP server, so I figured I would just do that. Wanting to try something new, i gave isync (AKA mbsync
) a try. Oh dear, I did not expect how much trouble just talking to my IMAP server would be. It's not isync's fault, for what it's worth, but it's the primary tool I used to debug things, and served me well in that regard.
mailbox corruption
The first thing I found out is that certain messages in the IMAP spool were corrupted. mbsync
would stop on a FETCH
command and Dovecot would give me those errors on the server side:
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Maildir filename has wrong W value, renamed the file from /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S to /home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495:2,S
nov 16 15:31:27 marcos dovecot[3621800]: imap(anarcat)<3630489><wAmSzO3QZtfAqAB1>: Error: Mailbox junk: Deleting corrupted cache record uid=1582: UID 1582: Broken virtual size in mailbox junk: read(/home/anarcat/Maildir/.junk/cur/1454623938.M101164P22216.marcos,S=2495,W=2578:2,S): FETCH BODY[] got too little data: 2540 vs 2578
and:
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=mail stream)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: Deleting corrupted cache record uid=19288: UID 19288: Broken physical size in mailbox Sent: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288)
nov 16 13:53:08 marcos dovecot[3520770]: imap(anarcat)<3594402><M5JHb+zQ3NLAqAB1>: Error: Mailbox Sent: UID=19288: read(/home/anarcat/Maildir/.Sent/cur/1224790447.M898726P9811V000000000000FE06I00794FB1_0.marvin,S=2588:2,S) failed: Cached message size larger than expected (2588 > 2482, box=Sent, UID=19288) (read reason=)
nov 16 13:53:08 marcos dovecot[3520770]: imap-login: Panic: epoll_ctl(del, 7) failed: Bad file descriptor
"wrong W value"
At least the first error was automatically healed by Dovecot (by renaming the file without the W=
flag). The problem is that the FETCH
command fails and mbsync
exits noisily. So you need to constantly restart mbsync
with a silly command like:
while ! mbsync -a; do sleep 1; done
"cached message size larger than expected"
The second problem is much harder to fix, because dovecot does not recover automatically. The above loop was taking too long: one full IMAP roundtrip (with authentication) for every corrupt message...
Workaround
So I read a lot on the Dovecot documentation on the maildir format, and wrote an extensive fix script for those two errors. The script worked and mbsync was able to sync the entire mail spool.
And no, rebuilding the index files didn't work. Also tried doveadm force-resync -u anarcat
which didn't do anything.
In the end I also had to do this, because the wrong cache values were also stored elsewhere.
service dovecot stop ; find -name 'dovecot*' -delete; service dovecot start
This would have totally broken any existing clients, but thankfully I'm starting from scratch (except maybe webmail, but I'm hoping it will self-heal as well, assuming it only has a cache and not a full replica of the mail spool).
Incoherence between Maildir and IMAP
Unfortunately, the first mbsync was incomplete as it was missing about 15,000 mails:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
384836
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
369221
As it turns out, mbsync
was not at fault here either: this was yet more mail spool corruption.
It's actually 26 folders with different sizes, which can be found with:
for folder in * .[^.]* ; do
printf "%s\t%d\n" $folder $(find "$folder" -type f -a \! -name '.*' | wc -l );
done
The special \! -name '.*'
bit is to ignore the mbsync metadata, which creates .uidvalidity
and .mbsyncstate
in every folder. That ignores about 200 files but since they are spread around all folders, which was making it impossible to review where the problem was.
Here is what the diff looks like:
--- Maildir-list 2021-11-17 20:42:36.504246752 -0500
+++ Maildir-mbsync-list 2021-11-17 20:18:07.731806601 -0500
@@ -6,16 +6,15 @@
[...]
.Archives 1
.Archives.2010 3553
-.Archives.2011 3583
-.Archives.2012 12593
+.Archives.2011 3582
+.Archives.2012 620
.Archives.2013 8576
.Archives.2014 11057
-.Archives.2015 8173
+.Archives.2015 8165
.Archives.2016 54
.band 34
.bitbuck 1
@@ -38,13 +37,12 @@
.couchsurfers 2
-cur 11285
+cur 11280
.current 130
.cv 2
.debbug 262
-.debian 37544
-drafts 1
-.Drafts 4
+.debian 37533
+.Drafts 2
.drone 241
.drupal 188
.drupal-devel 303
[...]
Misfiled messages
It's a bit all over the place, but we can already notice some huge differences between mailboxes, for example in the Archives
folders. As it turns out, at least 12,000 of those missing mails were actually misfiled: instead of being in the Maildir/.Archives.2012/cur/
folder, they were directly in Maildir/.Archives.2012/
. This is something that doesn't matter for SMD (and possibly for notmuch? it does matter, notmuch suddenly found 12,000 new mails) but that definitely matters to Dovecot and therefore mbsync
...
After moving those files around, we still have 4,000 message missing:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
381196
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385053
The problem is that those 4,000 missing mails are harder to track. Take, for example, .Archives.2011
, which has a single messge missing, out of 3,582. And the files are not identical: the checksums don't match after going through the IMAP transport, so we can't use a tool like hashdeep to compare the trees and find why that one file is missing.
"register" folder
One big chunk of the 4,000, however, is a special folder called register
in my spool, which I was syncing separately (see Securing registration email for details on that setup). That actually covers 3,700 of those messages, so I actually have a more modest 300 messages to figure out, after (easily!) configuring mbsync
to sync that folder separately:
@@ -30,9 +33,29 @@ Slave :anarcat-local:
# Exclude everything under the internal [Gmail] folder, except the interesting folders
#Patterns * ![Gmail]* "[Gmail]/Sent Mail" "[Gmail]/Starred" "[Gmail]/All Mail"
# Or include everything
-Patterns *
+#Patterns *
+Patterns * !register !.register
# Automatically create missing mailboxes, both locally and on the server
#Create Both
Create slave
# Sync the movement of messages between folders and deletions, add after making sure the sync works
#Expunge Both
+
+IMAPAccount anarcat-register
+Host imap.anarc.at
+User register
+PassCmd "pass imap.anarc.at-register"
+SSLType IMAPS
+CertificateFile /etc/ssl/certs/ca-certificates.crt
+
+IMAPStore anarcat-register-remote
+Account anarcat-register
+
+MaildirStore anarcat-register-local
+SubFolders Maildir++
+Inbox ~/Maildir-mbsync/.register/
+
+Channel anarcat-register
+Master :anarcat-register-remote:
+Slave :anarcat-register-local:
+Create slave
"tmp" folders and empty messages
After syncing the "register" messages, I end up with the measly little 160 emails out of sync:
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384900
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
385059
Argh. After more digging, I have found 131 mails in the tmp/
directories of the client's mail spool. Mysterious! On the server side, it's even more files, and not the same ones. Possible that those were mails that were left there during a failed delivery of some sort, during a power failure or some sort of crash? Who knows.
The first thing to do with those is to cleanup a bunch of empty files (21 on angela):
find .[^.]*/tmp -type f -empty -delete
As that turns out, they are all duplicates, in the sense that notmuch can easily find a copy of files with the same message ID in its database. In other words, this hairy command returns nothing
find .[^.]*/tmp -type f | while read path; do
msgid=$(grep -m 1 -i ^message-id "$path" | sed 's/Message-ID: //i;s/[<>]//g');
if notmuch count --exclude=false "id:$msgid" | grep -q 0; then
echo "$path <$msgid> not in notmuch" ;
fi;
done
... which is good. Or, to put it another way, this is safe:
find .[^.]*/tmp -type f -delete
Poof! 314 mails cleaned on the server side. Interestingly, SMD doesn't pick up on those changes at all and still sees files in tmp/
directories on the client side, so we need to operate the same twisted logic there.
notmuch to the rescue again
After cleaning that on the client, we get:
anarcat@angela:~(main)$ find Maildir/ -type f -a \! -name '.*' | wc -l
384928
anarcat@angela:~(main)$ find Maildir-mbsync/ -type f -a \! -name '.*' | wc -l
384901
Ha! 27 mails difference. Those are the really sticky, unclear ones. I was hoping a full sync might clear that up, but after deleting the entire directory and starting from scratch, I end up with:
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385034
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
384993
That is: even more messages missing (now 37). Sigh.
Hopefully, this is something notmuch will be able to help with: it can index all files by Message-ID
(which I learned is case-insensitive, yay) and tell us which messages don't make it through.
Considering the corruption I found in the mail spool, I wouldn't be the least surprised those messages are just skipped by the IMAP server because of corruption. Unfortunately, there's nothing on the Dovecot server logs that would explain the discrepancy.
Here again, notmuch comes to the rescue. We can list all message IDs to figure out that discrepancy:
notmuch search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-msgids
notmuch --config=.notmuch-config-mbsync search --exclude=false --output=messages '*' | pv -s 18M | sort > Maildir-mbsync-msgids
And then we can see how many messages notmuch thinks are missing:
$ wc -l *msgids
372723 Maildir-mbsync-msgids
372752 Maildir-msgids
That's 29 messages. Oddly, it doesn't exactly match the find
output:
anarcat@angela:~(main)$ find Maildir-mbsync -type f -type f -a \! -name '.*' | wc -l
385204
anarcat@angela:~(main)$ find Maildir -type f -type f -a \! -name '.*' | wc -l
385241
That is 10 more messages. Ugh. But actually, I know what those are: more misfiled messages (in a .folder/draft/
directory, bizarrely, so the totals actually match.
In the notmuch output, there's a lot of stuff like this:
id:notmuch-sha1-fb880d673e24f5dae71b6b4d825d4a0d5d01cde4
Those are messages without a valid Message-ID. Notmuch (presumably) constructs one based on the file's checksum. Because the files differ between the IMAP server and the local mail spool (which is unfortunate, but possibly inevitable), those do not match. There are exactly the same number of those on both sides, so I'll go ahead and assume those are all accounted for.
What remains is:
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\-[^-]' | grep -v sha1 | wc -l
2
anarcat@angela:~(main)$ diff -u Maildir-mbsync-msgids Maildir-msgids | grep '^\+[^+]' | grep -v sha1 | wc -l
21
anarcat@angela:~(main)$
ie. 21 missing from mbsync, and, surprisingly, 2 missing from the original mail spool.
Further inspection also showed they were all messages with some sort of "corruption": no body and only headers. I am not sure that is a legal email format in the first place. Since they were mostly spam or administrative emails ("You have been unsubscribed from mailing list..."), it seems fairly harmless to ignore those.
Conclusion
As we'll see in the next article, SMD has stellar performance, but this comes at a huge cost: it accesses the mail storage directly. This can (and has) created significant problems on the mail server. It's unclear exactly why those things happen, but Dovecot expects a particular storage format on its file, and it seems unwise to bypass that.
In the future, I'll try to remember to avoid that, especially since mechanisms like SMD require special server access (SSH) which, in the long term, I am not sure I want to maintain or expect.
In other words, just talking with an IMAP server opens up a lot more possibilities of hosting than setting up a custom synchronisation protocol over SSH. It's also safer and more reliable, as we have seen. Thankfully, I've been able to recover from all the errors I could find, but it could have gone differently and it would have been possible for SMD to permanently corrupt significant part of my mail archives.
I recommend SMD users start looking for alternatives. The project has been archived upstream, and the Debian package has been orphaned. I have seen significant mail box corruption, including entire mail spool destruction, mostly due to incorrect locking code. I have filed a release-critical bug in Debian to make sure it doesn't ship with Debian bookworm.
Alternatives like mbsync
provide fast and reliable transport, including over SSH. See the next article for further discussion of the alternatives.
from Planet Python
via read more
No comments:
Post a Comment