Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.12.4
-
None
-
CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64
-
3
-
9223372036854775807
Description
We have been running lfs migrate -m 1 on a client for several days now to free up inodes on a MDT, but when trying to launch multiple lfs migrate -m 1 (like more than 4) on different directory trees, at the same time, on a (single) client, we ended up crashing the MDS of fir-MDT0001 with the following assertion:
[Fri May 1 16:46:48 2020][3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]^M [Fri May 1 16:46:48 2020][3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: ^M
backtrace:
[3824898.399369] Lustre: fir-MDT0001: Connection restored to 7862f6c9-0098-4 (at 10.50.8.41@o2ib2) [3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0] [3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: [3824911.248150] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) LBUG [3824911.254941] Pid: 22403, comm: mdt00_022 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 [3824911.265305] Call Trace: [3824911.267941] [<ffffffffc0c9b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [3824911.274687] [<ffffffffc0c9b87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [3824911.281077] [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass] [3824911.287838] [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass] [3824911.295127] [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [3824911.302240] [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass] [3824911.308908] [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt] [3824911.315306] [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [3824911.322778] [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt] [3824911.329526] [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt] [3824911.335765] [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [3824911.342508] [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt] [3824911.348401] [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [3824911.355537] [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [3824911.363446] [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [3824911.369949] [<ffffffffb04c2e81>] kthread+0xd1/0xe0 [3824911.375036] [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21 [3824911.381683] [<ffffffffffffffff>] 0xffffffffffffffff [3824911.386893] Kernel panic - not syncing: LBUG [3824911.391339] CPU: 28 PID: 22403 Comm: mdt00_022 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 [3824911.404191] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019 [3824911.412016] Call Trace: [3824911.414649] [<ffffffffb0b65147>] dump_stack+0x19/0x1b [3824911.419967] [<ffffffffb0b5e850>] panic+0xe8/0x21f [3824911.424938] [<ffffffffc0c9b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [3824911.431323] [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass] [3824911.438044] [<ffffffffc0e9c39b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass] [3824911.445715] [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass] [3824911.452951] [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [3824911.460011] [<ffffffffc1830a7e>] ? lod_xattr_get+0xee/0x700 [lod] [3824911.466387] [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass] [3824911.473014] [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt] [3824911.479378] [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [3824911.486780] [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt] [3824911.493499] [<ffffffffc0eb3fa9>] ? check_unlink_entry+0x19/0xd0 [obdclass] [3824911.500654] [<ffffffffc0eb4bf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass] [3824911.508318] [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt] [3824911.514503] [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [3824911.521213] [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt] [3824911.527097] [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [3824911.534180] [<ffffffffc11bc0b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [3824911.541925] [<ffffffffc0c9bbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [3824911.549176] [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [3824911.557038] [<ffffffffc1183565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [3824911.564004] [<ffffffffb04cfeb4>] ? __wake_up+0x44/0x50 [3824911.569438] [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [3824911.575911] [<ffffffffc118a270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [3824911.583478] [<ffffffffb04c2e81>] kthread+0xd1/0xe0 [3824911.588528] [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40 [3824911.594796] [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21 [3824911.601407] [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
Note: please ignore the following lines in the logs, they are not relevant, it's just a script that tried periodically to access some wrong sysfs files (ie. it is not a backend device error):
mpt3sas_cm0: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)
Shortly before the crash, we can see the following lines in syslog:
[3824620.448341] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x2400576ec:0x149ae:0x0]: rc = -2 [3824640.928979] LustreError: 42016:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xcf72:0x0]: rc = -2 [3824658.778134] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xe2f2:0x0]: rc = -2 [3824678.792561] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xf468:0x0]: rc = -2 [3824696.395767] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x10477:0x0]: rc = -2 [3824714.310806] LustreError: 42123:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x11206:0x0]: rc = -2 [3824730.506605] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x120e7:0x0]: rc = -2 [3824768.104569] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x13f7d:0x0]: rc = -2 [3824768.117096] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message [3824822.439361] LustreError: 28226:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0001: object [0x240057703:0x16d59:0x0] not found: rc = -2 [3824840.123675] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x17b2e:0x0]: rc = -2
I tried a fid2path on those FIDs from a client, but they cannot be found.
This issue has occurred only once, on May 1. I'm attaching vmcore-dmesg.txt as fir-md1-s2_20200501_vmcore-dmesg.txt.
vmcore:
KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 48 DATE: Fri May 1 16:46:48 2020 UPTIME: 44 days, 06:26:43 LOAD AVERAGE: 1.98, 1.74, 1.43 TASKS: 1919 NODENAME: fir-md1-s2 RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64 VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019 MACHINE: x86_64 (1996 Mhz) MEMORY: 255.6 GB PANIC: "Kernel panic - not syncing: LBUG" PID: 22403 COMMAND: "mdt00_022" TASK: ffff8b0d6f65d140 [THREAD_INFO: ffff8afd6cfc4000] CPU: 28 STATE: TASK_RUNNING (PANIC)
I have uploaded this vmcore to WC's FTP server as fir-md1-s2_20200501164658_vmcore
Also attached the output of "foreach bt" as fir-md1-s2_crash_foreach_bt_20200501164658.txt
Let me know if you need anything else that could help in avoiding that in the future. Thanks!