[LU-11411] Lustre/ZFS snapshots mount error from llog - enhancement of snapshot-mount logic Created: 20/Sep/18  Updated: 25/Sep/18  Resolved: 25/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Benjamin Kirk Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.5, triple homed Ethernet/FDR/EDR servers


Issue Links:
Duplicate
duplicates LU-11193 lsnapshot mount fails with DNE Resolved
Epic/Theme: zfs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

(creating an LU based on email traffic, see http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2018-September/015898.html)

We have two filesystems, fsA & fsB (eadc below). Both of which get snapshots taken daily, rotated over a week. It's a beautiful feature we've been using in production ever since it was introduced with 2.10.

-) We've got Lustre/ZFS 2.10.4 on CentOS 7.5.
-) Both fsA & fsB have changelogs active.
-) fsA has combined mgt/mdt on a single ZFS filesystem.
-) fsB has a single mdt on a single ZFS filesystem.
-) for fsA, I have no issues mounting any of the snapshots via lctl.
-) for fsB, I can mount the most three recent snapshots, then encounter errors:

 [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Mon
 mounted the snapshot eadc_AutoSS-Mon with fsname 3d40bbc
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
 eadc_AutoSS-Mon
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sun
 mounted the snapshot eadc_AutoSS-Sun with fsname 584c07a
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
 eadc_AutoSS-Sun
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Sat
 mounted the snapshot eadc_AutoSS-Sat with fsname 4e646fe
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_umount -F eadc -n
 eadc_AutoSS-Sat
 [root at hpfs-fsl-mds0 ~]# lctl snapshot_mount -F eadc -n eadc_AutoSS-Fri
 mount.lustre: mount metadata/meta-eadc at eadc_AutoSS-Fri at
 /mnt/eadc_AutoSS-Fri_MDT0000 failed: Read-only file system Can't mount
 the snapshot eadc_AutoSS-Fri: Read-only file system


The relevant bits from dmesg are

 [1353434.417762] Lustre: 3d40bbc-MDT0000: set dev_rdonly on this
 device [1353434.417765] Lustre: Skipped 3 previous similar messages
 [1353434.649480] Lustre: 3d40bbc-MDT0000: Imperative Recovery enabled,
 recovery window shrunk from 300-900 down to 150-900 [1353434.649484]
 Lustre: Skipped 3 previous similar messages [1353434.866228] Lustre:
 3d40bbc-MDD0000: changelog on [1353434.866233] Lustre: Skipped 1
 previous similar message [1353435.427744] Lustre: 3d40bbc-MDT0000:
 Connection restored to ... at tcp<[!/images/icons/mail_small.gif|width=13,height=12,align=absmiddle!|https://jira.whamcloud.com/secure/mailto:]... at tcp> (at ... at tcp<[!/images/icons/mail_small.gif|width=13,height=12,align=absmiddle!|https://jira.whamcloud.com/secure/mailto:]... at tcp>) [1353435.427747] Lustre:
 Skipped 23 previous similar messages [1353445.255899] Lustre: Failing
 over 3d40bbc-MDT0000 [1353445.255903] Lustre: Skipped 3 previous
 similar messages [1353445.256150] LustreError: 11-0:
 3d40bbc-OST0000-osc-MDT0000: operation ost_disconnect to node ... at tcp<[!/images/icons/mail_small.gif|width=13,height=12,align=absmiddle!|https://jira.whamcloud.com/secure/mailto:]... at tcp>
 failed: rc = -107 [1353445.257896] LustreError: Skipped 23 previous
 similar messages [1353445.353874] Lustre: server umount
 3d40bbc-MDT0000 complete [1353445.353877] Lustre: Skipped 3 previous
 similar messages [1353475.302224] Lustre: 4e646fe-MDD0000: changelog
 on [1353475.302228] Lustre: Skipped 1 previous similar message [1353498.964016] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) 36ca26b-MDT0000-osd: someone try to start transaction under readonly mode, should be disabled.
 [1353498.967260] LustreError: 25582:0:(osd_handler.c:341:osd_trans_create()) Skipped 1 previous similar message
 [1353498.968829] CPU: 6 PID: 25582 Comm: mount.lustre Kdump: loaded Tainted: P OE ------------ 3.10.0-862.6.3.el7.x86_64 #1
 [1353498.968830] Hardware name: Supermicro SYS-6027TR-D71FRF/X9DRT,
 BIOS 3.2a 08/04/2015 [1353498.968832] Call Trace:
 [1353498.968841] [<ffffffffb5b0e80e>] dump_stack+0x19/0x1b
 [1353498.968851] [<ffffffffc0cbe5db>] osd_trans_create+0x38b/0x3d0
 [osd_zfs] [1353498.968876] [<ffffffffc1116044>]
 llog_destroy+0x1f4/0x3f0 [obdclass] [1353498.968887]
 [<ffffffffc111f0f6>] llog_cat_reverse_process_cb+0x246/0x3f0
 [obdclass] [1353498.968897] [<ffffffffc111a32c>]
 llog_reverse_process+0x38c/0xaa0 [obdclass] [1353498.968910]
 [<ffffffffc111eeb0>] ? llog_cat_process_cb+0x4e0/0x4e0 [obdclass]
 [1353498.968922] [<ffffffffc111af69>]
 llog_cat_reverse_process+0x179/0x270 [obdclass] [1353498.968932]
 [<ffffffffc1115585>] ? llog_init_handle+0xd5/0x9a0 [obdclass]
 [1353498.968943] [<ffffffffc1116e78>] ? llog_open_create+0x78/0x320
 [obdclass] [1353498.968949] [<ffffffffc12e55f0>] ?
 mdd_root_get+0xf0/0xf0 [mdd] [1353498.968954] [<ffffffffc12ec7af>]
 mdd_prepare+0x13ff/0x1c70 [mdd] [1353498.968966] [<ffffffffc166b037>]
 mdt_prepare+0x57/0x3b0 [mdt] [1353498.968983] [<ffffffffc1183afd>]
 server_start_targets+0x234d/0x2bd0 [obdclass] [1353498.968999]
 [<ffffffffc1153500>] ? class_config_dump_handler+0x7e0/0x7e0
 [obdclass] [1353498.969012] [<ffffffffc118541d>]
 server_fill_super+0x109d/0x185a [obdclass] [1353498.969025]
 [<ffffffffc115cef8>] lustre_fill_super+0x328/0x950 [obdclass]
 [1353498.969038] [<ffffffffc115cbd0>] ?
 lustre_common_put_super+0x270/0x270 [obdclass] [1353498.969041]
 [<ffffffffb561f3bf>] mount_nodev+0x4f/0xb0 [1353498.969053]
 [<ffffffffc1154f18>] lustre_mount+0x38/0x60 [obdclass]
 [1353498.969055] [<ffffffffb561ff3e>] mount_fs+0x3e/0x1b0 [1353498.969060] [<ffffffffb563d4b7>] vfs_kern_mount+0x67/0x110 [1353498.969062] [<ffffffffb563fadf>] do_mount+0x1ef/0xce0 [1353498.969066] [<ffffffffb55f7c2c>] ? kmem_cache_alloc_trace+0x3c/0x200 [1353498.969069] [<ffffffffb5640913>] SyS_mount+0x83/0xd0 [1353498.969074] [<ffffffffb5b20795>] system_call_fastpath+0x1c/0x21 [1353498.969079] LustreError: 25582:0:(llog_cat.c:1027:llog_cat_reverse_process_cb()) 36ca26b-MDD0000: fail to destroy empty log: rc = -30
 [1353498.970785] CPU: 6 PID: 25582 Comm: mount.lustre Kdump: loaded Tainted: P OE ------------ 3.10.0-862.6.3.el7.x86_64 #1
 [1353498.970786] Hardware name: Supermicro SYS-6027TR-D71FRF/X9DRT,
 BIOS 3.2a 08/04/2015 [1353498.970787] Call Trace:
 [1353498.970790] [<ffffffffb5b0e80e>] dump_stack+0x19/0x1b
 [1353498.970795] [<ffffffffc0cbe5db>] osd_trans_create+0x38b/0x3d0
 [osd_zfs] [1353498.970807] [<ffffffffc1117921>]
 llog_cancel_rec+0xc1/0x880 [obdclass] [1353498.970817]
 [<ffffffffc111e13b>] llog_cat_cleanup+0xdb/0x380 [obdclass]
 [1353498.970827] [<ffffffffc111f14d>]
 llog_cat_reverse_process_cb+0x29d/0x3f0 [obdclass] [1353498.970838]
 [<ffffffffc111a32c>] llog_reverse_process+0x38c/0xaa0 [obdclass]
 [1353498.970848] [<ffffffffc111eeb0>] ?
 llog_cat_process_cb+0x4e0/0x4e0 [obdclass] [1353498.970858]
 [<ffffffffc111af69>] llog_cat_reverse_process+0x179/0x270 [obdclass]
 [1353498.970868] [<ffffffffc1115585>] ? llog_init_handle+0xd5/0x9a0
 [obdclass] [1353498.970878] [<ffffffffc1116e78>] ?
 llog_open_create+0x78/0x320 [obdclass] [1353498.970883]
 [<ffffffffc12e55f0>] ? mdd_root_get+0xf0/0xf0 [mdd] [1353498.970887]
 [<ffffffffc12ec7af>] mdd_prepare+0x13ff/0x1c70 [mdd] [1353498.970894]
 [<ffffffffc166b037>] mdt_prepare+0x57/0x3b0 [mdt] [1353498.970908]
 [<ffffffffc1183afd>] server_start_targets+0x234d/0x2bd0 [obdclass]
 [1353498.970924] [<ffffffffc1153500>] ?
 class_config_dump_handler+0x7e0/0x7e0 [obdclass] [1353498.970938]
 [<ffffffffc118541d>] server_fill_super+0x109d/0x185a [obdclass]
 [1353498.970950] [<ffffffffc115cef8>] lustre_fill_super+0x328/0x950
 [obdclass] [1353498.970962] [<ffffffffc115cbd0>] ?
 lustre_common_put_super+0x270/0x270 [obdclass] [1353498.970964]
 [<ffffffffb561f3bf>] mount_nodev+0x4f/0xb0 [1353498.970976]
 [<ffffffffc1154f18>] lustre_mount+0x38/0x60 [obdclass]
 [1353498.970978] [<ffffffffb561ff3e>] mount_fs+0x3e/0x1b0
 [1353498.970980] [<ffffffffb563d4b7>] vfs_kern_mount+0x67/0x110
 [1353498.970982] [<ffffffffb563fadf>] do_mount+0x1ef/0xce0
 [1353498.970984] [<ffffffffb55f7c2c>] ?
 kmem_cache_alloc_trace+0x3c/0x200 [1353498.970986]
 [<ffffffffb5640913>] SyS_mount+0x83/0xd0 [1353498.970989]
 [<ffffffffb5b20795>] system_call_fastpath+0x1c/0x21 [1353498.970996]
 LustreError: 25582:0:(mdd_device.c:354:mdd_changelog_llog_init())
 36ca26b-MDD0000: changelog init failed: rc = -30 [1353498.972790]
 LustreError: 25582:0:(mdd_device.c:427:mdd_changelog_init())
 36ca26b-MDD0000: changelog setup during init failed: rc = -30
 [1353498.974525] LustreError:
 25582:0:(mdd_device.c:1061:mdd_prepare()) 36ca26b-MDD0000: failed to
 initialize changelog: rc = -30 [1353498.976229] LustreError:
 25582:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start
 targets: -30 [1353499.072002] LustreError:
 25582:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-30)

I'm hoping those traces mean something to someone - any ideas?

Thanks!



 Comments   
Comment by Nathaniel Clark [ 20/Sep/18 ]

I believe this is a duplicate of LU-11193

The ZFS system has to have DNE enabled for the mount of the snapshot to fail. 

There is a patch that appears to fix this issue:

https://review.whamcloud.com/33157

 

Comment by Benjamin Kirk [ 20/Sep/18 ]

We only have a single MDT on each filesystem; we just happen to have two separate filesystems hosted on the same servers.  So it's not clear to me we have DNE in the equation.

Comment by Nathaniel Clark [ 21/Sep/18 ]

Okay.  It's the same code path.  I guess one of your snapshots had llog data that needed to be cleaned up.  DNE always has llog data to cleanup.

Comment by Benjamin Kirk [ 21/Sep/18 ]

Ahh good. I’m glad it is something you can repeat in your test environment!

Comment by Benjamin Kirk [ 25/Sep/18 ]

I can confirm that the referenced patch (https://review.whamcloud.com/33157) on top of 2.10.5 allows me to mount all 11 snapshots from fsA and all 7 from fsB.

Thanks!

 

Comment by Andreas Dilger [ 25/Sep/18 ]

Close as a duplicate of LU-11193, whiuch already has a patch.

Generated at Sat Feb 10 02:43:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.