[LU-15716] OSD-ZFS / panic when mounting Lustre with ZFS_DEBUG enabled Created: 04/Apr/22  Updated: 04/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Lukasz Flis Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

ZFS: 2.1.3
Lustre: 2.15 (2.15.0_RC2_38_g8e8bbc0)


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

While trying to debug LU-15586 we have enabled 
ZFS_DEBUG with the following set of flags:

ZFS_DEBUG_MODIFY | ZFS_DEBUG_DBUF_VERIFY | ZFS_DEBUG_DNODE_VERIFY

Enabling ZFS_DEBUG implies usage of a different set of refcount functions from zfs.
In order to make lustre build possible zfs module has to be modified in order to export zfs_refcount_add function which is invoked from osd-zfs
Modification in module/zfs/refcount.c

+#if defined(_KERNEL)
+EXPORT_SYMBOL(zfs_refcount_add);
+#endif

 

Unfortunately - enabling extra checks in ZFS results in kernel PANIC when mounting ost/mdt resources

[87796.641026] Kernel panic - not syncing: dirtying dbuf obj=c80092 lvl=0 blkid=10 but not tx_held

[87796.677146] CPU: 18 PID: 16207 Comm: mount.lustre Kdump: loaded Tainted: P          IOE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[87796.715272] Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 7.99 03/11/2021
[87796.735224] Call Trace:
87796.750373]  dump_stack+0x5c/0x80
[87796.766219]  panic+0xe7/0x2a9
[87796.781619]  dmu_tx_dirty_buf+0x117/0x3f0 [zfs]
[87796.798682]  ? rrw_enter_read_impl+0x125/0x220 [zfs]
[87796.815679]  dbuf_dirty+0x5e/0x1530 [zfs]
[87796.831998]  ? dbuf_read+0x139/0x680 [zfs]
[87796.847813]  dmu_write_impl+0x44/0x150 [zfs]
[87796.863641]  dmu_write_by_dnode+0x8e/0xe0 [zfs]
[87796.879597]  osd_write+0x118/0x3a0 [osd_zfs]
[87796.895392]  dt_record_write+0x32/0x110 [obdclass]
[87796.911190]  llog_osd_write_rec+0xd06/0x1ae0 [obdclass]
[87796.927699]  llog_write_rec+0x3f6/0x530 [obdclass]
[87796.943235]  llog_write+0x4df/0x550 [obdclass]
[87796.958243]  llog_process_thread+0xb8e/0x1aa0 [obdclass]
[87796.974051]  ? llog_process_or_fork+0x5e/0x560 [obdclass]
[87796.990079]  ? kmem_cache_alloc_trace+0x131/0x270
[87797.004891]  ? llog_write+0x550/0x550 [obdclass]
[87797.019560]  llog_process_or_fork+0x1c1/0x560 [obdclass]
[87797.034729]  llog_backup+0x354/0x520 [obdclass]
[87797.048909]  mgc_llog_local_copy+0x110/0x420 [mgc]
[87797.063504]  mgc_process_cfg_log+0x971/0xd80 [mgc]
[87797.077634]  mgc_process_log+0x6c3/0x800 [mgc]
[87797.091414]  ? config_log_add+0x3f5/0xa00 [mgc]
[87797.104982]  mgc_process_config+0xb53/0xe60 [mgc]
[87797.118529]  lustre_process_log+0x5fa/0xad0 [obdclass]
[87797.132327]  ? server_register_mount+0x4d1/0x740 [obdclass]
[87797.146529]  server_start_targets+0x1504/0x3010 [obdclass]
[87797.160454]  ? strlcpy+0x2d/0x40
[87797.171898]  ? class_config_dump_handler+0x730/0x730 [obdclass]
[87797.186054]  ? mgc_set_info_async+0x539/0xad0 [mgc]
[87797.198949]  ? mgc_set_info_async+0x539/0xad0 [mgc]
[87797.211583]  ? lustre_start_mgc+0xf7c/0x27c0 [obdclass]
[87797.224758]  server_fill_super+0x8ea/0x10d0 [obdclass]
[87797.237408]  lustre_fill_super+0x3a1/0x3f0 [lustre]
[87797.249568]  ? ll_inode_destroy_callback+0x120/0x120 [lustre]
[87797.262647]  mount_nodev+0x48/0xa0
[87797.273054]  legacy_get_tree+0x27/0x40
[87797.283602]  vfs_get_tree+0x25/0xb0
[87797.294051]  do_mount+0x2e2/0x950
[87797.303806]  ksys_mount+0xb6/0xd0
[87797.313335]  __x64_sys_mount+0x21/0x30
[87797.323140]  do_syscall_64+0x5b/0x1a0
[87797.332660]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[87797.343521] RIP: 0033:0x7fb3c21f892e 

For reference: on the ZFS side problem is tracked here:

https://github.com/openzfs/zfs/issues/13144


Generated at Sat Feb 10 03:20:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.