[LU-13511] MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.14.0, Lustre 2.12.6
Affects Version/s: Lustre 2.12.4
Labels:
None
Environment:
CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have been running lfs migrate -m 1 on a client for several days now to free up inodes on a MDT, but when trying to launch multiple lfs migrate -m 1 (like more than 4) on different directory trees, at the same time, on a (single) client, we ended up crashing the MDS of fir-MDT0001 with the following assertion:

[Fri May  1 16:46:48 2020][3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]^M
[Fri May  1 16:46:48 2020][3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: ^M

backtrace:

[3824898.399369] Lustre: fir-MDT0001: Connection restored to 7862f6c9-0098-4 (at 10.50.8.41@o2ib2)
[3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]
[3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed:
[3824911.248150] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) LBUG
[3824911.254941] Pid: 22403, comm: mdt00_022 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[3824911.265305] Call Trace:
[3824911.267941]  [<ffffffffc0c9b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[3824911.274687]  [<ffffffffc0c9b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[3824911.281077]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
[3824911.287838]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
[3824911.295127]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[3824911.302240]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
[3824911.308908]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
[3824911.315306]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[3824911.322778]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
[3824911.329526]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
[3824911.335765]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[3824911.342508]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
[3824911.348401]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[3824911.355537]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[3824911.363446]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[3824911.369949]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
[3824911.375036]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
[3824911.381683]  [<ffffffffffffffff>] 0xffffffffffffffff
[3824911.386893] Kernel panic - not syncing: LBUG
[3824911.391339] CPU: 28 PID: 22403 Comm: mdt00_022 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
[3824911.404191] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
[3824911.412016] Call Trace:
[3824911.414649]  [<ffffffffb0b65147>] dump_stack+0x19/0x1b
[3824911.419967]  [<ffffffffb0b5e850>] panic+0xe8/0x21f
[3824911.424938]  [<ffffffffc0c9b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[3824911.431323]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
[3824911.438044]  [<ffffffffc0e9c39b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
[3824911.445715]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
[3824911.452951]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[3824911.460011]  [<ffffffffc1830a7e>] ? lod_xattr_get+0xee/0x700 [lod]
[3824911.466387]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
[3824911.473014]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
[3824911.479378]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[3824911.486780]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
[3824911.493499]  [<ffffffffc0eb3fa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
[3824911.500654]  [<ffffffffc0eb4bf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
[3824911.508318]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
[3824911.514503]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[3824911.521213]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
[3824911.527097]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[3824911.534180]  [<ffffffffc11bc0b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[3824911.541925]  [<ffffffffc0c9bbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[3824911.549176]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[3824911.557038]  [<ffffffffc1183565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[3824911.564004]  [<ffffffffb04cfeb4>] ? __wake_up+0x44/0x50
[3824911.569438]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[3824911.575911]  [<ffffffffc118a270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[3824911.583478]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
[3824911.588528]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
[3824911.594796]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
[3824911.601407]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40

Note: please ignore the following lines in the logs, they are not relevant, it's just a script that tried periodically to access some wrong sysfs files (ie. it is not a backend device error):

mpt3sas_cm0: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)

Shortly before the crash, we can see the following lines in syslog:

[3824620.448341] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x2400576ec:0x149ae:0x0]: rc = -2

[3824640.928979] LustreError: 42016:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xcf72:0x0]: rc = -2
[3824658.778134] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xe2f2:0x0]: rc = -2
[3824678.792561] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xf468:0x0]: rc = -2
[3824696.395767] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x10477:0x0]: rc = -2
[3824714.310806] LustreError: 42123:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x11206:0x0]: rc = -2
[3824730.506605] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x120e7:0x0]: rc = -2

[3824768.104569] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x13f7d:0x0]: rc = -2
[3824768.117096] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
[3824822.439361] LustreError: 28226:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0001: object [0x240057703:0x16d59:0x0] not found: rc = -2
[3824840.123675] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x17b2e:0x0]: rc = -2

I tried a fid2path on those FIDs from a client, but they cannot be found.

This issue has occurred only once, on May 1. I'm attaching vmcore-dmesg.txt as fir-md1-s2_20200501_vmcore-dmesg.txt.

vmcore:

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Fri May  1 16:46:48 2020
      UPTIME: 44 days, 06:26:43
LOAD AVERAGE: 1.98, 1.74, 1.43
       TASKS: 1919
    NODENAME: fir-md1-s2
     RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
     VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
     MACHINE: x86_64  (1996 Mhz)
      MEMORY: 255.6 GB
       PANIC: "Kernel panic - not syncing: LBUG"
         PID: 22403
     COMMAND: "mdt00_022"
        TASK: ffff8b0d6f65d140  [THREAD_INFO: ffff8afd6cfc4000]
         CPU: 28
       STATE: TASK_RUNNING (PANIC)

I have uploaded this vmcore to WC's FTP server as fir-md1-s2_20200501164658_vmcore
Also attached the output of "foreach bt" as fir-md1-s2_crash_foreach_bt_20200501164658.txt

Let me know if you need anything else that could help in avoiding that in the future. Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

fir-md1-s2_20200501_vmcore-dmesg.txt
806 kB
03/May/20 8:40 PM
fir-md1-s2_crash_foreach_bt_20200501164658.txt
873 kB
03/May/20 8:46 PM
fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt
561 kB
08/Aug/20 4:25 PM

Activity

[LU-13511] MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed

Gerrit Updater added a comment - 29/Oct/20 7:49 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40304/
Subject: ~~LU-13511~~ obdclass: don't initialize obj for zero FID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 44f354e53f42e25a4bfa98d50dcdc4397f06a9e2

Gerrit Updater added a comment - 29/Oct/20 7:49 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40304/ Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 44f354e53f42e25a4bfa98d50dcdc4397f06a9e2

Gerrit Updater added a comment - 19/Oct/20 7:11 PM

Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40304
Subject: ~~LU-13511~~ obdclass: don't initialize obj for zero FID
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 050863780310ae1166fc5c00b083675ac16502c4

Gerrit Updater added a comment - 19/Oct/20 7:11 PM Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40304 Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 050863780310ae1166fc5c00b083675ac16502c4

Peter Jones added a comment - 02/Oct/20 2:50 AM

Landed for 2.14

Peter Jones added a comment - 02/Oct/20 2:50 AM Landed for 2.14

Gerrit Updater added a comment - 02/Oct/20 12:19 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39792/
Subject: ~~LU-13511~~ obdclass: don't initialize obj for zero FID
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 22ea9767956c89aa08ef6d80ad04aaccde647755

Gerrit Updater added a comment - 02/Oct/20 12:19 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39792/ Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: master Current Patch Set: Commit: 22ea9767956c89aa08ef6d80ad04aaccde647755

Stephane Thiell added a comment - 10/Sep/20 10:56 PM

Hi Lai,

I think your patch fixed this problem. We have been running it on a MDS (the one receiving new files, that is, the MDS of the target MDT of lfs migrate -m) for 6 days now while migrations were running and no crash. We're not done yet with our migrations so I will let you know if we notice any issue.

Stephane Thiell added a comment - 10/Sep/20 10:56 PM Hi Lai, I think your patch fixed this problem. We have been running it on a MDS (the one receiving new files, that is, the MDS of the target MDT of lfs migrate -m) for 6 days now while migrations were running and no crash. We're not done yet with our migrations so I will let you know if we notice any issue.

Gerrit Updater added a comment - 02/Sep/20 9:57 AM

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39792
Subject: ~~LU-13511~~ obdclass: don't initialize obj for zero FID
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3e930b6ac01585e95e4ed561aa9e62c5e792f5c

Gerrit Updater added a comment - 02/Sep/20 9:57 AM Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39792 Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3e930b6ac01585e95e4ed561aa9e62c5e792f5c

Stephane Thiell added a comment - 02/Sep/20 12:00 AM

Hi Lai,

Do you have any idea on how to avoid this crash? I would be happy to try a patch.

Thanks!

Stephane Thiell added a comment - 02/Sep/20 12:00 AM Hi Lai, Do you have any idea on how to avoid this crash? I would be happy to try a patch. Thanks!

Stephane Thiell added a comment - 08/Aug/20 4:26 PM

Hi Lai,

Argh, this just hit us again with 2.12.5 on a MDS. In fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt, we can see "failed to get lu_attr" errors just before this LBUG. I have two vmcore's if needed (the MDS crashed again just after recovery). I had to shut down all of our lfs migrate -m and the robinhood server that was reading changelogs to be able to start again. A restart of MDT0 was also needed to fully clear all timeouts.

[2639627.935716] Lustre: fir-MDT0003: Client edb25609-39fb-4 (at 10.49.0.63@o2ib1) reconnecting
[2639627.944184] Lustre: fir-MDT0003: Connection restored to  (at 10.49.0.63@o2ib1)
[2663131.800332] LustreError: 68703:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0xc90f:0x0]: rc = -2
[2663136.343377] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1697e:0x0]: rc = -2
[2663136.355987] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
[2663136.905828] LustreError: 66683:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17604:0x0] not found: rc = -2
[2663137.608393] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1764d:0x0] not found: rc = -2
[2663137.620611] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 3 previous similar messages
[2663138.330280] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1771d:0x0]: rc = -2
[2663138.342891] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) Skipped 3 previous similar messages
[2663138.705822] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17780:0x0] not found: rc = -2
[2663138.717962] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 6 previous similar messages
[2663140.353575] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17904:0x0]: rc = -2
[2663140.366103] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) Skipped 5 previous similar messages
[2663140.950475] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17943:0x0] not found: rc = -2
[2663140.962567] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 12 previous similar messages
[2663144.507278] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17c2b:0x0]: rc = -2
[2663144.519822] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) Skipped 7 previous similar messages
[2663145.078058] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17c57:0x0] not found: rc = -2
[2663145.090149] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 18 previous similar messages
[2663152.600233] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x182ec:0x0]: rc = -2
[2663152.612774] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 37 previous similar messages
[2663153.134962] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1836f:0x0] not found: rc = -2
[2663153.147065] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 38 previous similar messages
[2663175.740735] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x18bf2:0x0] not found: rc = -2
[2663175.752900] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 63 previous similar messages
[2663176.789503] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x19344:0x0]: rc = -2
[2663176.802031] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 50 previous similar messages
[2663207.817889] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1aa2a:0x0] not found: rc = -2
[2663207.829992] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 32 previous similar messages
[2663209.047366] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1a8fc:0x0]: rc = -2
[2663209.059905] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) Skipped 22 previous similar messages
[2663271.909174] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1db03:0x0] not found: rc = -2
[2663271.921256] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 113 previous similar messages
[2663273.161523] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1dbf1:0x0]: rc = -2
[2663273.174067] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) Skipped 94 previous similar messages
[2663403.135529] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044522:0x4449:0x0] not found: rc = -2
[2663403.147555] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 378 previous similar messages
[2663403.645800] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044522:0x4bea:0x0]: rc = -2
[2663403.658315] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) Skipped 323 previous similar messages
[2663455.535626] LustreError: 66692:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0003-mdtlov: invalid FID [0x0:0x0:0x0]
[2663455.545914] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: 
[2663455.560424] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) LBUG
[2663455.567250] Pid: 66692, comm: mdt02_035 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
[2663455.577613] Call Trace:
[2663455.580267]  [<ffffffffc0bc87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[2663455.587021]  [<ffffffffc0bc887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[2663455.593439]  [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass]
[2663455.600209]  [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass]
[2663455.607502]  [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[2663455.614604]  [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass]
[2663455.621293]  [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt]
[2663455.627727]  [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[2663455.635193]  [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt]
[2663455.641939]  [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt]
[2663455.648176]  [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[2663455.654933]  [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt]
[2663455.660810]  [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[2663455.667959]  [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2663455.675866]  [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[2663455.682391]  [<ffffffffb5ec2e81>] kthread+0xd1/0xe0
[2663455.687479]  [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21
[2663455.694143]  [<ffffffffffffffff>] 0xffffffffffffffff
[2663455.699346] Kernel panic - not syncing: LBUG
[2663455.703792] CPU: 22 PID: 66692 Comm: mdt02_035 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
[2663455.716670] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.12.2 11/15/2019
[2663455.724504] Call Trace:
[2663455.727152]  [<ffffffffb6565147>] dump_stack+0x19/0x1b
[2663455.732474]  [<ffffffffb655e850>] panic+0xe8/0x21f
[2663455.737460]  [<ffffffffc0bc88cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[2663455.743849]  [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass]
[2663455.750580]  [<ffffffffc0d0f42b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
[2663455.758261]  [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass]
[2663455.765507]  [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
[2663455.772569]  [<ffffffffc16ccbfe>] ? lod_xattr_get+0xee/0x700 [lod]
[2663455.778947]  [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass]
[2663455.785576]  [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt]
[2663455.791945]  [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
[2663455.799349]  [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt]
[2663455.806077]  [<ffffffffc0d270a9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
[2663455.813234]  [<ffffffffc0d27cf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
[2663455.820902]  [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt]
[2663455.827089]  [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
[2663455.833797]  [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt]
[2663455.839683]  [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
[2663455.846777]  [<ffffffffc0fde0d1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
[2663455.854527]  [<ffffffffc0bc8bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[2663455.861785]  [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[2663455.869656]  [<ffffffffc0fa5575>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
[2663455.876621]  [<ffffffffb5ecfeb4>] ? __wake_up+0x44/0x50
[2663455.882061]  [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[2663455.888544]  [<ffffffffc0fac280>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
[2663455.896114]  [<ffffffffb5ec2e81>] kthread+0xd1/0xe0
[2663455.901175]  [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40
[2663455.907450]  [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21
[2663455.914070]  [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40

Stephane Thiell added a comment - 08/Aug/20 4:26 PM Hi Lai, Argh, this just hit us again with 2.12.5 on a MDS. In fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt , we can see "failed to get lu_attr" errors just before this LBUG. I have two vmcore's if needed (the MDS crashed again just after recovery). I had to shut down all of our lfs migrate -m and the robinhood server that was reading changelogs to be able to start again. A restart of MDT0 was also needed to fully clear all timeouts. [2639627.935716] Lustre: fir-MDT0003: Client edb25609-39fb-4 (at 10.49.0.63@o2ib1) reconnecting [2639627.944184] Lustre: fir-MDT0003: Connection restored to (at 10.49.0.63@o2ib1) [2663131.800332] LustreError: 68703:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0xc90f:0x0]: rc = -2 [2663136.343377] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1697e:0x0]: rc = -2 [2663136.355987] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message [2663136.905828] LustreError: 66683:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17604:0x0] not found: rc = -2 [2663137.608393] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1764d:0x0] not found: rc = -2 [2663137.620611] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 3 previous similar messages [2663138.330280] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1771d:0x0]: rc = -2 [2663138.342891] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) Skipped 3 previous similar messages [2663138.705822] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17780:0x0] not found: rc = -2 [2663138.717962] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 6 previous similar messages [2663140.353575] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17904:0x0]: rc = -2 [2663140.366103] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) Skipped 5 previous similar messages [2663140.950475] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17943:0x0] not found: rc = -2 [2663140.962567] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 12 previous similar messages [2663144.507278] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17c2b:0x0]: rc = -2 [2663144.519822] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) Skipped 7 previous similar messages [2663145.078058] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17c57:0x0] not found: rc = -2 [2663145.090149] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 18 previous similar messages [2663152.600233] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x182ec:0x0]: rc = -2 [2663152.612774] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 37 previous similar messages [2663153.134962] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1836f:0x0] not found: rc = -2 [2663153.147065] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 38 previous similar messages [2663175.740735] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x18bf2:0x0] not found: rc = -2 [2663175.752900] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 63 previous similar messages [2663176.789503] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x19344:0x0]: rc = -2 [2663176.802031] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 50 previous similar messages [2663207.817889] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1aa2a:0x0] not found: rc = -2 [2663207.829992] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 32 previous similar messages [2663209.047366] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1a8fc:0x0]: rc = -2 [2663209.059905] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) Skipped 22 previous similar messages [2663271.909174] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1db03:0x0] not found: rc = -2 [2663271.921256] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 113 previous similar messages [2663273.161523] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1dbf1:0x0]: rc = -2 [2663273.174067] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) Skipped 94 previous similar messages [2663403.135529] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044522:0x4449:0x0] not found: rc = -2 [2663403.147555] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 378 previous similar messages [2663403.645800] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044522:0x4bea:0x0]: rc = -2 [2663403.658315] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) Skipped 323 previous similar messages [2663455.535626] LustreError: 66692:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0003-mdtlov: invalid FID [0x0:0x0:0x0] [2663455.545914] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: [2663455.560424] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) LBUG [2663455.567250] Pid: 66692, comm: mdt02_035 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 [2663455.577613] Call Trace: [2663455.580267] [<ffffffffc0bc87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [2663455.587021] [<ffffffffc0bc887c>] lbug_with_loc+0x4c/0xa0 [libcfs] [2663455.593439] [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass] [2663455.600209] [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass] [2663455.607502] [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [2663455.614604] [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass] [2663455.621293] [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt] [2663455.627727] [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [2663455.635193] [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt] [2663455.641939] [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt] [2663455.648176] [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [2663455.654933] [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt] [2663455.660810] [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [2663455.667959] [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2663455.675866] [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2663455.682391] [<ffffffffb5ec2e81>] kthread+0xd1/0xe0 [2663455.687479] [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21 [2663455.694143] [<ffffffffffffffff>] 0xffffffffffffffff [2663455.699346] Kernel panic - not syncing: LBUG [2663455.703792] CPU: 22 PID: 66692 Comm: mdt02_035 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 [2663455.716670] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.12.2 11/15/2019 [2663455.724504] Call Trace: [2663455.727152] [<ffffffffb6565147>] dump_stack+0x19/0x1b [2663455.732474] [<ffffffffb655e850>] panic+0xe8/0x21f [2663455.737460] [<ffffffffc0bc88cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [2663455.743849] [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass] [2663455.750580] [<ffffffffc0d0f42b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass] [2663455.758261] [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass] [2663455.765507] [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [2663455.772569] [<ffffffffc16ccbfe>] ? lod_xattr_get+0xee/0x700 [lod] [2663455.778947] [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass] [2663455.785576] [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt] [2663455.791945] [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [2663455.799349] [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt] [2663455.806077] [<ffffffffc0d270a9>] ? check_unlink_entry+0x19/0xd0 [obdclass] [2663455.813234] [<ffffffffc0d27cf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass] [2663455.820902] [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt] [2663455.827089] [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [2663455.833797] [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt] [2663455.839683] [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [2663455.846777] [<ffffffffc0fde0d1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2663455.854527] [<ffffffffc0bc8bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2663455.861785] [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2663455.869656] [<ffffffffc0fa5575>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2663455.876621] [<ffffffffb5ecfeb4>] ? __wake_up+0x44/0x50 [2663455.882061] [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2663455.888544] [<ffffffffc0fac280>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2663455.896114] [<ffffffffb5ec2e81>] kthread+0xd1/0xe0 [2663455.901175] [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40 [2663455.907450] [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21 [2663455.914070] [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40

Lai Siyao added a comment - 09/May/20 4:28 AM

This is due to a striped directory layout is broken, and some stripe FID is [0:0:0], such FID is used internally, and should not be used normally, I'll add a patch to fix this.

Lai Siyao added a comment - 09/May/20 4:28 AM This is due to a striped directory layout is broken, and some stripe FID is [0:0:0] , such FID is used internally, and should not be used normally, I'll add a patch to fix this.

Peter Jones added a comment - 04/May/20 5:31 PM

Hongchao

Could you please investigate?

Thanks

Peter

Peter Jones added a comment - 04/May/20 5:31 PM Hongchao Could you please investigate? Thanks Peter

People

Assignee:: Lai Siyao

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 03/May/20 8:49 PM

Updated:: 29/Oct/20 3:10 PM

Resolved:: 02/Oct/20 2:50 AM