We’d like to spotlight TrueOS community member Brad Alexander for documenting his experience repairing ZFS dataset replication with TrueOS. Thank you! His notes are posted here, and they’ve been added to the TrueOS handbook Troubleshooting section for later reference.
Forcibly Resetting ZFS Replication Using Command Line lpreserver
ZFS replication can be somewhat complex, and keeping all of the fiddly bits aligned can be fraught with danger. I recently had both of my TrueOS machines start failing to replicate. My desktop is called defiant, and has two pools, NX74205 and NCC1764. My laptop is yukon, and the pool is NCC74602. I am replicating to my FreeNAS server luna, to dataset NX80101/archive/<FQDN>. I will focus on what I did to get yukon working again in this document.
The SysAdm Client tray icon was pulsing red. Right-clicking on the icon and clicking Messages would show the message:
FAILED replication task on NCC74602 -> 192.168.47.20: LOGFILE: /var/log/lpreserver/lpreserver_failed.log
which was lifted from /var/log/lpreserver/lpreserver.log.
/var/log/lpreserver/lastrep-send.log shows very little information:
send from @auto-2017-07-12-01-00-00 to NCC74602/ROOT/12.0-CURRENT-up-20170623_120331@auto-2017-07-14-01-00-00 total estimated size is 0 TIME SENT SNAPSHOT
And no useful errors were being written to the lpreserver_failed.log.
The first approach I tried was to use the Sysadm Client:
I clicked on the dataset in question, then clicked Initialize. After waiting a few minutes, I clicked Start. I was immediately rewarded with a pulsing red icon in the system tray and received the same messages as above.
I was working with, and want to specially thank @RodMyers and @NorwegianRockCat. They suggested I use the lpreserver command line. So I issued these commands:
sudo lpreserver replicate init NCC74602 192.168.47.20 sudo lpreserver replicate run NCC74602 192.168.47.20
Unfortunately, the replication failed again. I got these messages in the logs:
Fri Jul 14 09:03:34 EDT 2017: Removing NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT - re-created locally cannot unmount '/mnt/NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT': Operation not permitted Failed creating remote dataset! cannot create 'NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT': dataset already exists
It turned out there were a number of children. I logged into luna (the FreeNAS) and issued this command as root:
zfs destroy -r NX80101/archive/defiant.sonsofthunder.nanobit.org
I then ran the replicate init and replicate run commands again from the TrueOS host, and replication worked! It has continued to work too, at least until the next fiddly bit breaks.
Be a part of the TrueOS Community! Users are friendly and knowledgeable about TrueOS and general Open Source computing, so stop by one of our channels and ask questions or join a discussion! TrueOS uses Gitter for real time chat and Discourse for our public forum.