Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have been running lfs migrate -m 1 on a client for several days now to free up inodes on a MDT, but when trying to launch multiple lfs migrate -m 1 (like more than 4) on different directory trees, at the same time, on a (single) client, we ended up crashing the MDS of fir-MDT0001 with the following assertion:

[Fri May  1 16:46:48 2020][3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]^M
[Fri May  1 16:46:48 2020][3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: ^M

backtrace:

[3824898.399369] Lustre: fir-MDT0001: Connection restored to 7862f6c9-0098-4 (at 10.50.8.41@o2ib2)
[3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]
[3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed:
[3824911.248150] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) LBUG
[3824911.254941] Pid: 22403, comm: mdt00_022 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[3824911.265305] Call Trace:
[3824911.267941]  [<ffffffffc0c9b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[3824911.274687]  [<ffffffffc0c9b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[3824911.281077]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
[3824911.287838]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
[3824911.295127]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[3824911.302240]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
[3824911.308908]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
[3824911.315306]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[3824911.322778]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
[3824911.329526]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
[3824911.335765]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[3824911.342508]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
[3824911.348401]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[3824911.355537]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[3824911.363446]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[3824911.369949]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
[3824911.375036]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
[3824911.381683]  [<ffffffffffffffff>] 0xffffffffffffffff
[3824911.386893] Kernel panic - not syncing: LBUG
[3824911.391339] CPU: 28 PID: 22403 Comm: mdt00_022 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
[3824911.404191] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
[3824911.412016] Call Trace:
[3824911.414649]  [<ffffffffb0b65147>] dump_stack+0x19/0x1b
[3824911.419967]  [<ffffffffb0b5e850>] panic+0xe8/0x21f
[3824911.424938]  [<ffffffffc0c9b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[3824911.431323]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
[3824911.438044]  [<ffffffffc0e9c39b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
[3824911.445715]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
[3824911.452951]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[3824911.460011]  [<ffffffffc1830a7e>] ? lod_xattr_get+0xee/0x700 [lod]
[3824911.466387]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
[3824911.473014]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
[3824911.479378]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[3824911.486780]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
[3824911.493499]  [<ffffffffc0eb3fa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
[3824911.500654]  [<ffffffffc0eb4bf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
[3824911.508318]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
[3824911.514503]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[3824911.521213]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
[3824911.527097]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[3824911.534180]  [<ffffffffc11bc0b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[3824911.541925]  [<ffffffffc0c9bbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[3824911.549176]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[3824911.557038]  [<ffffffffc1183565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[3824911.564004]  [<ffffffffb04cfeb4>] ? __wake_up+0x44/0x50
[3824911.569438]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[3824911.575911]  [<ffffffffc118a270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[3824911.583478]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
[3824911.588528]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
[3824911.594796]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
[3824911.601407]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40

Note: please ignore the following lines in the logs, they are not relevant, it's just a script that tried periodically to access some wrong sysfs files (ie. it is not a backend device error):

mpt3sas_cm0: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)

Shortly before the crash, we can see the following lines in syslog:

[3824620.448341] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x2400576ec:0x149ae:0x0]: rc = -2

[3824640.928979] LustreError: 42016:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xcf72:0x0]: rc = -2
[3824658.778134] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xe2f2:0x0]: rc = -2
[3824678.792561] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xf468:0x0]: rc = -2
[3824696.395767] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x10477:0x0]: rc = -2
[3824714.310806] LustreError: 42123:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x11206:0x0]: rc = -2
[3824730.506605] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x120e7:0x0]: rc = -2

[3824768.104569] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x13f7d:0x0]: rc = -2
[3824768.117096] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
[3824822.439361] LustreError: 28226:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0001: object [0x240057703:0x16d59:0x0] not found: rc = -2
[3824840.123675] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x17b2e:0x0]: rc = -2

I tried a fid2path on those FIDs from a client, but they cannot be found.

This issue has occurred only once, on May 1. I'm attaching vmcore-dmesg.txt as fir-md1-s2_20200501_vmcore-dmesg.txt.

vmcore:

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Fri May  1 16:46:48 2020
      UPTIME: 44 days, 06:26:43
LOAD AVERAGE: 1.98, 1.74, 1.43
       TASKS: 1919
    NODENAME: fir-md1-s2
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 22403
     COMMAND: "mdt00_022"
        TASK: ffff8b0d6f65d140  [THREAD_INFO: ffff8afd6cfc4000]
         CPU: 28
       STATE: TASK_RUNNING (PANIC)

I have uploaded this vmcore to WC's FTP server as fir-md1-s2_20200501164658_vmcore
Also attached the output of "foreach bt" as fir-md1-s2_crash_foreach_bt_20200501164658.txt

Let me know if you need anything else that could help in avoiding that in the future. Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s2_20200501_vmcore-dmesg.txt
806 kB
03/May/20 8:40 PM
fir-md1-s2_crash_foreach_bt_20200501164658.txt
873 kB
03/May/20 8:46 PM
fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt
561 kB
08/Aug/20 4:25 PM

MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void )0) && top->loh_hash.pprev == ((void )0) ) failed

Details

Description

Attachments

Attachments

Activity

People

Dates