Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.7
-
lustre-2.12.7_2.llnl-2
zfs-0.7.11-9.8llnl.ch6.x86_64
3.10.0-1160.45.1.1chaos.ch6.x86_64
rhel7-based
-
3
-
9223372036854775807
Description
Looks like the same issue as LU-10433 to me. Occurred during shutdown of an MDT. vmcore-dmesg.txt file from crash dump shows:
[7949565.740966] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b09c0[0x4, 1, [0x1:0x0:0x0] hash exist]{ [7949565.756016] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0a10 [7949565.768738] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c80140osd-zfs-object@ffffa03aa6c80140 [7949565.783886] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b09c0 [7949565.795741] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) header@ffffa03aa81b00c0[0x4, 1, [0xa:0x0:0x0] hash exist]{ [7949565.810788] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....local_storage@ffffa03aa81b0110 [7949565.823509] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) ....osd-zfs@ffffa03aa6c803c0osd-zfs-object@ffffa03aa6c803c0 [7949565.838649] LustreError: 4224:0:(osd_handler.c:1351:osd_device_free()) } header@ffffa03aa81b00c0 [7949565.850558] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) ASSERTION( !cfs_hash_with_assert_empty(hs) ) failed: hash lu_site_osd-zfs bucket 1(3) is not empty: 1 items left [7949565.868513] LustreError: 4224:0:(hash.c:1111:cfs_hash_destroy()) LBUG [7949565.875900] Pid: 4224, comm: umount 3.10.0-1160.36.2.1chaos.ch6.x86_64 #1 SMP Wed Jul 21 15:34:23 PDT 2021 [7949565.886871] Call Trace: [7949565.889807] [<ffffffffc12407ec>] libcfs_call_trace+0x8c/0xd0 [libcfs] [7949565.897316] [<ffffffffc12408ac>] lbug_with_loc+0x4c/0xa0 [libcfs] [7949565.904423] [<ffffffffc124f85c>] cfs_hash_putref+0x3cc/0x520 [libcfs] [7949565.911928] [<ffffffffc1529e54>] lu_site_fini+0x54/0xa0 [obdclass] [7949565.919162] [<ffffffffc133d0cb>] osd_device_free+0x9b/0x2e0 [osd_zfs] [7949565.926665] [<ffffffffc14fcf82>] class_free_dev+0x4c2/0x720 [obdclass] [7949565.934267] [<ffffffffc14fd3e0>] class_export_put+0x200/0x2d0 [obdclass] [7949565.942059] [<ffffffffc14fef05>] class_unlink_export+0x145/0x180 [obdclass] [7949565.950159] [<ffffffffc1514990>] class_decref+0x80/0x160 [obdclass] [7949565.950169] [<ffffffffc1514e13>] class_detach+0x1d3/0x300 [obdclass] [7949565.950179] [<ffffffffc151bae8>] class_process_config+0x1a38/0x2830 [obdclass] [7949565.950189] [<ffffffffc151cac0>] class_manual_cleanup+0x1e0/0x710 [obdclass] [7949565.950197] [<ffffffffc133cd15>] osd_obd_disconnect+0x165/0x1a0 [osd_zfs] [7949565.950208] [<ffffffffc1526cc6>] lustre_put_lsi+0x106/0x4d0 [obdclass] [7949565.950217] [<ffffffffc1527200>] lustre_common_put_super+0x170/0x270 [obdclass] [7949565.950230] [<ffffffffc154ea00>] server_put_super+0x120/0xd00 [obdclass] [7949565.950235] [<ffffffffbca61e5d>] generic_shutdown_super+0x6d/0x110 [7949565.950236] [<ffffffffbca61f12>] kill_anon_super+0x12/0x20 [7949565.950246] [<ffffffffc151f6b2>] lustre_kill_super+0x32/0x50 [obdclass] [7949565.950247] [<ffffffffbca6189e>] deactivate_locked_super+0x4e/0x70 [7949565.950248] [<ffffffffbca61906>] deactivate_super+0x46/0x60 [7949565.950251] [<ffffffffbca8372f>] cleanup_mnt+0x3f/0x80 [7949565.950253] [<ffffffffbca837c2>] __cleanup_mnt+0x12/0x20 [7949565.950255] [<ffffffffbc8c7cab>] task_work_run+0xbb/0xf0 [7949565.950258] [<ffffffffbc82dd95>] do_notify_resume+0xa5/0xc0 [7949565.950262] [<ffffffffbcfc44ef>] int_signal+0x12/0x17 [7949565.950293] [<ffffffffffffffff>] 0xffffffffffffffff
We had never seen this before, but started seeing it frequently 2021-11-10. Before that there were two changes that may be related:
1. We enabled changelogs on the filesystem ("brass") and started consuming them
2. We updated the system to the above-mentioned kernel and lustre versions. I'll find the prior versions and post them.
We've seen this crash 5 times so far (in about 1 month), when failing MDTs over for maintenance on the nodes.
For Lustre patch stack, see https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl
For ZFS patch stack, see https://github.com/LLNL/zfs/releases/tag/zfs-0.7.11-8llnl
For SPL patch stack, see https://github.com/LLNL/spl/releases/tag/spl-0.7.11-8llnl
(Note that the spl/zfs rpm version is misleading, it really is the spl- and zfs-0.7.11-8llnl tag that was used to build those rpms)