So I had another major email crash with my syncmaildir setup. This time I was at least able to confirm the issue, and I still haven't lost mail thanks to backups and sheer luck (again).
The crash
It is not really worth going over the crash in details, it's fairly similar to the last one: something bad happened and smd
started destroying everything. The hint is that it takes a long time to do what usually takes seconds. It helps that I now have a second monitor showing logs.
I still lost much more mail than the last time. I used to have "301 723 messages", according to notmuch. But then when I ran smd-pull
by hand, it was telling me:
95K emails scanned
Oops. You can see notmuch
happily noticing the destroyed files on the server:
jun 28 16:33:40 marcos notmuch[28532]: No new mail. Removed 65498 messages. Detected 1699 file renames.
jun 28 16:36:05 marcos notmuch[29746]: No new mail. Removed 68883 messages. Detected 2488 file renames.
jun 28 16:41:40 marcos notmuch[31972]: No new mail. Removed 118295 messages. Detected 3657 file renames.
The final count ended up being 81 042 messages, according to notmuch. A whopping 220 000 mails deleted.
The interesting bit, this time around, is that I caught smd in the act of running two processes in parallel:
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
So clearly that is the source of the bug.
Recovery
Emergency stop on curie:
notmuch dump > notmuch.dump
systemctl --user --now disable smd-pull.service smd-pull.timer smd-push.service smd-push.timer notmuch-new.service notmuch-new.timer
On marcos (the server), guessed the number of messages delivered since the last backup to be 71, just looking at timestamps in the mail log. Made a list:
grep postfix/local /var/log/mail.log | tail -71 > lost-mail
Found postfix queue IDs:
sed 's/.*\]://;s/:.*//' lost-mail > qids
Turn those into message IDs, find those that are missing from the disk (had previously ran notmuch new
just to be sure it's up to date):
while read qid ; do
grep "$qid: message-id" /var/log/mail.log
done < qids | sed 's/.*message-id=<//;s/>//' | while read msgid; do
sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid
done
Copy this back on curie as missing-msgids
and:
$ wc -l missing-msgids
48 missing-msgids
$ while read msgid ; do notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done < missing-msgids
mailman.189.1624881611.23397.nodes-reseaulibre.ca@reseaulibre.ca
AnwMy7rdSpK-N-vt4AiOag@ismtpd0148p1mdw1.sendgrid.net
only two mails missing! whoohoo!
Copy those back onto marcos as really-missing-msgids
, and look at the full mail logs to see what they are:
~anarcat/src/koumbit-scripts/mail/postfix-trace --from-file really-missing-msgids2
I actually remembered deleting those, so no mail lost!
Rebuild the list of msgids that were lost, on marcos:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids | sed 's/.*message-id=<//;s/>//'
Copy that on curie as lost-mail-msgids
, then copy the files over in a test dir:
while read msgid ; do
notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v --files-from=- /home/anarcat/Maildir/ shell.anarc.at:restore/Maildir-angela/
If that looks about right, on marcos:
find restore/Maildir-angela/ -type f | wc -l
... should match the number of missing mails, roughly.
Copy if in the real spool:
while read msgid ; do
notmuch search --output=files --exclude=false "id:$msgid"
done < lost-mail-msgids | sed 's#/home/anarcat/Maildir/##' | rsync -v --files-from=- /home/anarcat/Maildir/ shell.anarc.at:Maildir/
Then on the server, notmuch new
should find the new emails, and we shouldn't have any lost mail anymore:
while read qid ; do grep "$qid: message-id" /var/log/mail.log; done < qids | sed 's/.*message-id=<//;s/>//' | while read msgid; do sudo -u anarcat notmuch count --exclude=false id:$msgid | grep -q 0 && echo $msgid ; done
Then, crucial moment, try to pull the new mails from the backups on curie:
anarcat@curie:~(main)$ smd-pull -n --show-tags -v
Found lockfile of a dead instance. Ignored.
Phase 0: handshake
Phase 1: changes detection
5K emails scanned
10K emails scanned
15K emails scanned
20K emails scanned
25K emails scanned
30K emails scanned
35K emails scanned
40K emails scanned
45K emails scanned
50K emails scanned
Phase 2: synchronization
Phase 3: agreement
default: smd-client@localhost: TAGS: stats::new-mails(49687), del-mails(0), bytes-received(215752279), xdelta-received(3703852)
"smd-pull -n --show-tags -v" took 3 mins 39 secs
This brought me back to the state after the backup plus the mails delivered during the day, which means I had to catchup with all my holiday's read emails (1440 mails!) but thankfully I made a dump of the notmuch database on curie at the start of the procedure, so this actually restored a sane state:
pv notmuch.dump | notmuch restore
Phew!
Workaround
I have filed this as a bug in upstream issue 18. Considering I filed 11 issues and only 3 of those were closed, I'm not holding my breath. I nevertheless filed PR 19 in the hope that this will fix my particular issue, but I'm not even sure this is the right fix...
Fix
At this point, I'm really ready to give up on SMD. It's really, really nice to be able to sync mail over SSH because I don't need to store my IMAP password on disk. But surely there are more reliable syncing mechanisms. I do not remember ever losing that much mail before. At worst, offlineimap would duplicate emails like mad, but never destroy my entire mail spool that way.
As mentioned before, there are other programs that sync mail. I'm looking at:
- offlineimap3: requires IMAP, used the py2 version in the past, might just still work, first sync painful
- isync/mbsync: might be faster, I remember having trouble switching from offlineimap to this
- getmail: just downloads email, might not be enough
- nncp: treat the local spool as another mail server, might not be compatible with my "multiple clients" setup
- doveadm-sync: requires dovecot on both ends, but supports using SSH to sync, will try this next
from Planet Python
via read more
No comments:
Post a Comment