Tuesday, June 29, 2021

Anarcat: Another syncmaildir crash

So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).

The crash

It is not really worth going over the crash in details, it's fairly similar to the last one: something bad happened and smd started destroying everything. The hint is that it takes a long time to do what usually takes seconds. It helps that I now have a second monitor showing logs.

I still lost much more mail than the last time. I used to have "301 723 messages", according to notmuch. But then when I ran smd-pull by hand, it was telling me:

95K emails scanned

Oops. You can see notmuch happily noticing the destroyed files on the server:

jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.

The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted.

The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:

jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir. 
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir... 
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir... 

So clearly that is the source of the bug.

Recovery

Emergency stop on curie:

notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer

On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:

grep postfix/local /var/log/mail.log | tail -71 > lost-mail

Found postfix queue IDs:

sed 's/.*\]://;s/:.*//' lost-mail > qids

Turn those into message IDs, find those that are missing from the disk (had previously ran notmuch new just to be sure it's up to date):

while read qid ; do 
    grep "$qid: message-id" /var/log/mail.log
done < qids  | sed 's/.*message-id=<//;s/>//' | while read msgid; do
    sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid
done

Copy this back on curie as missing-msgids and:

$ wc -l missing-msgids 
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net

only two mails missing! whoohoo!

Copy those back onto marcos as really-missing-msgids, and look at the full mail logs to see what they are:

~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2

I actually remembered deleting those, so no mail lost!

Rebuild the list of msgids that were lost, on marcos:

while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids  | sed 's/.*message-id=<//;s/>//'

Copy that on curie as lost-mail-msgids, then copy the files over in a test dir:

while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/

If that looks about right, on marcos:

find restore/Maildir-angela/ -type f | wc -l

... should match the number of missing mails, roughly.

Copy if in the real spool:

while read msgid ; do
    notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v  --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/

Then on the server, notmuch new should find the new emails, and we shouldn't have any lost mail anymore:

while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids  | sed 's/.*message-id=<//;s/>//' | while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done

Then, crucial moment, try to pull the new mails from the backups on curie:

anarcat@curie:~(main)$ smd-pull  -n  --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
    5K emails scanned
   10K emails scanned
   15K emails scanned
   20K emails scanned
   25K emails scanned
   30K emails scanned
   35K emails scanned
   40K emails scanned
   45K emails scanned
   50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull  -n  --show-tags -v" took 3 mins 39 secs

This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:

pv notmuch.dump | notmuch restore

Phew!

Workaround

I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...

Fix

At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way.

As mentioned before, there are other programs that sync mail. I'm looking at:

  • offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful
  • isync/mbsync: might be faster, I remember having trouble switching from offlineimap to this
  • getmail: just downloads email, might not be enough
  • nncp: treat the local spool as another mail server, might not be compatible with my "multiple clients" setup
  • doveadm-sync: requires dovecot on both ends, but supports using SSH to sync, will try this next


from Planet Python
via read more

No comments:

Post a Comment

TestDriven.io: Working with Static and Media Files in Django

This article looks at how to work with static and media files in a Django project, locally and in production. from Planet Python via read...