Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
Lustre 2.12.4
-
None
-
Dell R740. Centos 7.8. Kernel: 3.10.0-1127.el7.x86_64, lustre-2.12.4-1.el7.x86_64, zfs-0.7.13-1.el7.x86_64, spl-0.7.13-1.el7.x86_64
-
3
-
9223372036854775807
Description
Hi Folks,
We recently upgraded our Lustre ZFS servers at SUT and have been experiencing an issues with the ZFS filesytem crashing. Last week we upgraded from Lustre 2.10.5 (plus a dozen patches) & ZFS 0.7.9, over to Lustre 2.12.4 & ZFS 0.7.13
Now if we import and mount our main zfs/lustre filesystem, then and resume Slurm jobs and move onto starting the Slurm partitions we'll hit a kernel panic on the MDS shortly after the partitions are up:
May 8 20:12:37 warble2 kernel: VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg)) failed May 8 20:12:37 warble2 kernel: PANIC at dnode.c:1635:dnode_setdirty() May 8 20:12:37 warble2 kernel: Showing stack for process 45209 May 8 20:12:37 warble2 kernel: CPU: 7 PID: 45209 Comm: mdt01_123 Tainted: P OE ------------ 3.10.0-1127.el7.x86_64 #1 May 8 20:12:37 warble2 kernel: Hardware name: Dell Inc. PowerEdge R740/0JM3W2, BIOS 2.5.4 01/13/2020 May 8 20:12:37 warble2 kernel: Call Trace: May 8 20:12:37 warble2 kernel: [<ffffffff9077ff85>] dump_stack+0x19/0x1b May 8 20:12:37 warble2 kernel: [<ffffffffc04d4f24>] spl_dumpstack+0x44/0x50 [spl] May 8 20:12:37 warble2 kernel: [<ffffffffc04d4ff9>] spl_panic+0xc9/0x110 [spl] May 8 20:12:37 warble2 kernel: [<ffffffff900c7780>] ? wake_up_atomic_t+0x30/0x30 May 8 20:12:37 warble2 kernel: [<ffffffffc0c21073>] ? dbuf_rele_and_unlock+0x283/0x4c0 [zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc04d0238>] ? spl_kmem_zalloc+0xd8/0x180 [spl] May 8 20:12:37 warble2 kernel: [<ffffffff90784002>] ? mutex_lock+0x12/0x2f May 8 20:12:37 warble2 kernel: [<ffffffffc0c31a2c>] ? dmu_objset_userquota_get_ids+0x23c/0x440 [zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc0c40f39>] dnode_setdirty+0xe9/0xf0 [zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc0c4120c>] dnode_allocate+0x18c/0x230 [zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc0c2dd2b>] dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc1630032>] __osd_object_create+0x82/0x170 [osd_zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc163027b>] osd_mksym+0x6b/0x110 [osd_zfs] May 8 20:12:37 warble2 kernel: [<ffffffff907850c2>] ? down_write+0x12/0x3d May 8 20:12:37 warble2 kernel: [<ffffffffc162b966>] osd_create+0x316/0xaf0 [osd_zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc18ed9c5>] lod_sub_create+0x1f5/0x480 [lod] May 8 20:12:37 warble2 kernel: [<ffffffffc18de179>] lod_create+0x69/0x340 [lod] May 8 20:12:37 warble2 kernel: [<ffffffffc1622690>] ? osd_trans_create+0x410/0x410 [osd_zfs] May 8 20:12:37 warble2 kernel: [<ffffffffc1958173>] mdd_create_object_internal+0xc3/0x300 [mdd] May 8 20:12:37 warble2 kernel: [<ffffffffc194122b>] mdd_create_object+0x7b/0x820 [mdd] May 8 20:12:37 warble2 kernel: [<ffffffffc194b7b8>] mdd_create+0xdd8/0x14a0 [mdd] May 8 20:12:37 warble2 kernel: [<ffffffffc17d96d4>] mdt_create+0xb54/0x1090 [mdt] May 8 20:12:37 warble2 kernel: [<ffffffffc119ae94>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass] May 8 20:12:37 warble2 kernel: [<ffffffffc17d9d7b>] mdt_reint_create+0x16b/0x360 [mdt] May 8 20:12:37 warble2 kernel: [<ffffffffc17dc963>] mdt_reint_rec+0x83/0x210 [mdt] May 8 20:12:37 warble2 kernel: [<ffffffffc17b9273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] May 8 20:12:37 warble2 kernel: [<ffffffffc17c46e7>] mdt_reint+0x67/0x140 [mdt] May 8 20:12:37 warble2 kernel: [<ffffffffc14af64a>] tgt_request_handle+0xada/0x1570 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffffc1488d91>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffffc07dcbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] May 8 20:12:37 warble2 kernel: [<ffffffffc145447b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffffc1451295>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffff900d3dc3>] ? __wake_up+0x13/0x20 May 8 20:12:37 warble2 kernel: [<ffffffffc1457de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffffc14572b0>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] May 8 20:12:37 warble2 kernel: [<ffffffff900c6691>] kthread+0xd1/0xe0 May 8 20:12:37 warble2 kernel: [<ffffffff900c65c0>] ? insert_kthread_work+0x40/0x40 May 8 20:12:37 warble2 kernel: [<ffffffff90792d1d>] ret_from_fork_nospec_begin+0x7/0x21 May 8 20:12:37 warble2 kernel: [<ffffffff900c65c0>] ? insert_kthread_work+0x40/0x40
This issue has come up once last week, and twice tonight. We note there's a little bit of chatter over at: https://github.com/openzfs/zfs/issues/8705 but no real feedback yet, and it's been open for some time now. Are there any recommendations from the experience of Lustre developers on how we might mitigate this particular problem?
Right now we're cloning our server image to include ZFS 0.8.3 to see if that will help.
Cheers,
Simon