Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13511

MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • None
    • CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64
    • 3
    • 9223372036854775807

    Description

      We have been running lfs migrate -m 1 on a client for several days now to free up inodes on a MDT, but when trying to launch multiple lfs migrate -m 1 (like more than 4) on different directory trees, at the same time, on a (single) client, we ended up crashing the MDS of fir-MDT0001 with the following assertion:

      [Fri May  1 16:46:48 2020][3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]^M
      [Fri May  1 16:46:48 2020][3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: ^M
      

      backtrace:

      [3824898.399369] Lustre: fir-MDT0001: Connection restored to 7862f6c9-0098-4 (at 10.50.8.41@o2ib2)
      [3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]
      [3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed:
      [3824911.248150] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) LBUG
      [3824911.254941] Pid: 22403, comm: mdt00_022 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [3824911.265305] Call Trace:
      [3824911.267941]  [<ffffffffc0c9b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [3824911.274687]  [<ffffffffc0c9b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [3824911.281077]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
      [3824911.287838]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
      [3824911.295127]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
      [3824911.302240]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
      [3824911.308908]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
      [3824911.315306]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
      [3824911.322778]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
      [3824911.329526]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
      [3824911.335765]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [3824911.342508]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
      [3824911.348401]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [3824911.355537]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [3824911.363446]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [3824911.369949]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
      [3824911.375036]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [3824911.381683]  [<ffffffffffffffff>] 0xffffffffffffffff
      [3824911.386893] Kernel panic - not syncing: LBUG
      [3824911.391339] CPU: 28 PID: 22403 Comm: mdt00_022 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [3824911.404191] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
      [3824911.412016] Call Trace:
      [3824911.414649]  [<ffffffffb0b65147>] dump_stack+0x19/0x1b
      [3824911.419967]  [<ffffffffb0b5e850>] panic+0xe8/0x21f
      [3824911.424938]  [<ffffffffc0c9b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [3824911.431323]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
      [3824911.438044]  [<ffffffffc0e9c39b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [3824911.445715]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
      [3824911.452951]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
      [3824911.460011]  [<ffffffffc1830a7e>] ? lod_xattr_get+0xee/0x700 [lod]
      [3824911.466387]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
      [3824911.473014]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
      [3824911.479378]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
      [3824911.486780]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
      [3824911.493499]  [<ffffffffc0eb3fa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
      [3824911.500654]  [<ffffffffc0eb4bf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
      [3824911.508318]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
      [3824911.514503]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [3824911.521213]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
      [3824911.527097]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [3824911.534180]  [<ffffffffc11bc0b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [3824911.541925]  [<ffffffffc0c9bbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [3824911.549176]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [3824911.557038]  [<ffffffffc1183565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [3824911.564004]  [<ffffffffb04cfeb4>] ? __wake_up+0x44/0x50
      [3824911.569438]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [3824911.575911]  [<ffffffffc118a270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [3824911.583478]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
      [3824911.588528]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
      [3824911.594796]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [3824911.601407]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
      

      Note: please ignore the following lines in the logs, they are not relevant, it's just a script that tried periodically to access some wrong sysfs files (ie. it is not a backend device error):

      mpt3sas_cm0: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)
      

       

      Shortly before the crash, we can see the following lines in syslog:

      [3824620.448341] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x2400576ec:0x149ae:0x0]: rc = -2
      
      [3824640.928979] LustreError: 42016:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xcf72:0x0]: rc = -2
      [3824658.778134] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xe2f2:0x0]: rc = -2
      [3824678.792561] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xf468:0x0]: rc = -2
      [3824696.395767] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x10477:0x0]: rc = -2
      [3824714.310806] LustreError: 42123:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x11206:0x0]: rc = -2
      [3824730.506605] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x120e7:0x0]: rc = -2
      
      [3824768.104569] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x13f7d:0x0]: rc = -2
      [3824768.117096] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
      [3824822.439361] LustreError: 28226:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0001: object [0x240057703:0x16d59:0x0] not found: rc = -2
      [3824840.123675] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x17b2e:0x0]: rc = -2
      

      I tried a fid2path on those FIDs from a client, but they cannot be found.

      This issue has occurred only once, on May 1. I'm attaching vmcore-dmesg.txt as fir-md1-s2_20200501_vmcore-dmesg.txt.

      vmcore:

            KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 48
              DATE: Fri May  1 16:46:48 2020
            UPTIME: 44 days, 06:26:43
      LOAD AVERAGE: 1.98, 1.74, 1.43
             TASKS: 1919
          NODENAME: fir-md1-s2
           RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
           VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
           MACHINE: x86_64  (1996 Mhz)
            MEMORY: 255.6 GB
             PANIC: "Kernel panic - not syncing: LBUG"
               PID: 22403
           COMMAND: "mdt00_022"
              TASK: ffff8b0d6f65d140  [THREAD_INFO: ffff8afd6cfc4000]
               CPU: 28
             STATE: TASK_RUNNING (PANIC)
      

      I have uploaded this vmcore to WC's FTP server as fir-md1-s2_20200501164658_vmcore
      Also attached the output of "foreach bt" as fir-md1-s2_crash_foreach_bt_20200501164658.txt

      Let me know if you need anything else that could help in avoiding that in the future. Thanks!

      Attachments

        Activity

          [LU-13511] MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40304/
          Subject: LU-13511 obdclass: don't initialize obj for zero FID
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set:
          Commit: 44f354e53f42e25a4bfa98d50dcdc4397f06a9e2

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40304/ Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 44f354e53f42e25a4bfa98d50dcdc4397f06a9e2

          Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40304
          Subject: LU-13511 obdclass: don't initialize obj for zero FID
          Project: fs/lustre-release
          Branch: b2_12
          Current Patch Set: 1
          Commit: 050863780310ae1166fc5c00b083675ac16502c4

          gerrit Gerrit Updater added a comment - Jian Yu (yujian@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40304 Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 050863780310ae1166fc5c00b083675ac16502c4
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39792/
          Subject: LU-13511 obdclass: don't initialize obj for zero FID
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 22ea9767956c89aa08ef6d80ad04aaccde647755

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39792/ Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: master Current Patch Set: Commit: 22ea9767956c89aa08ef6d80ad04aaccde647755

          Hi Lai,

          I think your patch fixed this problem. We have been running it on a MDS (the one receiving new files, that is, the MDS of the target MDT of lfs migrate -m) for 6 days now while migrations were running and no crash. We're not done yet with our migrations so I will let you know if we notice any issue.

          sthiell Stephane Thiell added a comment - Hi Lai, I think your patch fixed this problem. We have been running it on a MDS (the one receiving new files, that is, the MDS of the target MDT of lfs migrate -m) for 6 days now while migrations were running and no crash. We're not done yet with our migrations so I will let you know if we notice any issue.

          Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39792
          Subject: LU-13511 obdclass: don't initialize obj for zero FID
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: e3e930b6ac01585e95e4ed561aa9e62c5e792f5c

          gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39792 Subject: LU-13511 obdclass: don't initialize obj for zero FID Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3e930b6ac01585e95e4ed561aa9e62c5e792f5c

          Hi Lai,

          Do you have any idea on how to avoid this crash? I would be happy to try a patch. 

          Thanks!

          sthiell Stephane Thiell added a comment - Hi Lai, Do you have any idea on how to avoid this crash? I would be happy to try a patch.  Thanks!

          Hi Lai,

          Argh, this just hit us again with 2.12.5 on a MDS. In fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt, we can see "failed to get lu_attr" errors just before this LBUG. I have two vmcore's if needed (the MDS crashed again just after recovery). I had to shut down all of our lfs migrate -m and the robinhood server that was reading changelogs to be able to start again. A restart of MDT0 was also needed to fully clear all timeouts.

           

          [2639627.935716] Lustre: fir-MDT0003: Client edb25609-39fb-4 (at 10.49.0.63@o2ib1) reconnecting
          [2639627.944184] Lustre: fir-MDT0003: Connection restored to  (at 10.49.0.63@o2ib1)
          [2663131.800332] LustreError: 68703:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0xc90f:0x0]: rc = -2
          [2663136.343377] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1697e:0x0]: rc = -2
          [2663136.355987] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
          [2663136.905828] LustreError: 66683:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17604:0x0] not found: rc = -2
          [2663137.608393] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1764d:0x0] not found: rc = -2
          [2663137.620611] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 3 previous similar messages
          [2663138.330280] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1771d:0x0]: rc = -2
          [2663138.342891] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) Skipped 3 previous similar messages
          [2663138.705822] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17780:0x0] not found: rc = -2
          [2663138.717962] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 6 previous similar messages
          [2663140.353575] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17904:0x0]: rc = -2
          [2663140.366103] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) Skipped 5 previous similar messages
          [2663140.950475] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17943:0x0] not found: rc = -2
          [2663140.962567] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 12 previous similar messages
          [2663144.507278] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17c2b:0x0]: rc = -2
          [2663144.519822] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) Skipped 7 previous similar messages
          [2663145.078058] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17c57:0x0] not found: rc = -2
          [2663145.090149] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 18 previous similar messages
          [2663152.600233] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x182ec:0x0]: rc = -2
          [2663152.612774] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 37 previous similar messages
          [2663153.134962] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1836f:0x0] not found: rc = -2
          [2663153.147065] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 38 previous similar messages
          [2663175.740735] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x18bf2:0x0] not found: rc = -2
          [2663175.752900] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 63 previous similar messages
          [2663176.789503] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x19344:0x0]: rc = -2
          [2663176.802031] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 50 previous similar messages
          [2663207.817889] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1aa2a:0x0] not found: rc = -2
          [2663207.829992] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 32 previous similar messages
          [2663209.047366] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1a8fc:0x0]: rc = -2
          [2663209.059905] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) Skipped 22 previous similar messages
          [2663271.909174] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1db03:0x0] not found: rc = -2
          [2663271.921256] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 113 previous similar messages
          [2663273.161523] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1dbf1:0x0]: rc = -2
          [2663273.174067] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) Skipped 94 previous similar messages
          [2663403.135529] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044522:0x4449:0x0] not found: rc = -2
          [2663403.147555] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 378 previous similar messages
          [2663403.645800] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044522:0x4bea:0x0]: rc = -2
          [2663403.658315] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) Skipped 323 previous similar messages
          [2663455.535626] LustreError: 66692:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0003-mdtlov: invalid FID [0x0:0x0:0x0]
          [2663455.545914] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: 
          [2663455.560424] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) LBUG
          [2663455.567250] Pid: 66692, comm: mdt02_035 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
          [2663455.577613] Call Trace:
          [2663455.580267]  [<ffffffffc0bc87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
          [2663455.587021]  [<ffffffffc0bc887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
          [2663455.593439]  [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass]
          [2663455.600209]  [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass]
          [2663455.607502]  [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
          [2663455.614604]  [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass]
          [2663455.621293]  [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt]
          [2663455.627727]  [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
          [2663455.635193]  [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt]
          [2663455.641939]  [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt]
          [2663455.648176]  [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
          [2663455.654933]  [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt]
          [2663455.660810]  [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
          [2663455.667959]  [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
          [2663455.675866]  [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
          [2663455.682391]  [<ffffffffb5ec2e81>] kthread+0xd1/0xe0
          [2663455.687479]  [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21
          [2663455.694143]  [<ffffffffffffffff>] 0xffffffffffffffff
          [2663455.699346] Kernel panic - not syncing: LBUG
          [2663455.703792] CPU: 22 PID: 66692 Comm: mdt02_035 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
          [2663455.716670] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.12.2 11/15/2019
          [2663455.724504] Call Trace:
          [2663455.727152]  [<ffffffffb6565147>] dump_stack+0x19/0x1b
          [2663455.732474]  [<ffffffffb655e850>] panic+0xe8/0x21f
          [2663455.737460]  [<ffffffffc0bc88cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
          [2663455.743849]  [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass]
          [2663455.750580]  [<ffffffffc0d0f42b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
          [2663455.758261]  [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass]
          [2663455.765507]  [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
          [2663455.772569]  [<ffffffffc16ccbfe>] ? lod_xattr_get+0xee/0x700 [lod]
          [2663455.778947]  [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass]
          [2663455.785576]  [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt]
          [2663455.791945]  [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
          [2663455.799349]  [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt]
          [2663455.806077]  [<ffffffffc0d270a9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
          [2663455.813234]  [<ffffffffc0d27cf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
          [2663455.820902]  [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt]
          [2663455.827089]  [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
          [2663455.833797]  [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt]
          [2663455.839683]  [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
          [2663455.846777]  [<ffffffffc0fde0d1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
          [2663455.854527]  [<ffffffffc0bc8bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
          [2663455.861785]  [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
          [2663455.869656]  [<ffffffffc0fa5575>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
          [2663455.876621]  [<ffffffffb5ecfeb4>] ? __wake_up+0x44/0x50
          [2663455.882061]  [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
          [2663455.888544]  [<ffffffffc0fac280>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
          [2663455.896114]  [<ffffffffb5ec2e81>] kthread+0xd1/0xe0
          [2663455.901175]  [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40
          [2663455.907450]  [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21
          [2663455.914070]  [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40
          

           

          sthiell Stephane Thiell added a comment - Hi Lai, Argh, this just hit us again with 2.12.5 on a MDS. In fir-md1-s4_vmcore-dmesg2020_08_08_05_13_40.txt , we can see "failed to get lu_attr" errors just before this LBUG. I have two vmcore's if needed (the MDS crashed again just after recovery). I had to shut down all of our lfs migrate -m and the robinhood server that was reading changelogs to be able to start again. A restart of MDT0 was also needed to fully clear all timeouts.   [2639627.935716] Lustre: fir-MDT0003: Client edb25609-39fb-4 (at 10.49.0.63@o2ib1) reconnecting [2639627.944184] Lustre: fir-MDT0003: Connection restored to (at 10.49.0.63@o2ib1) [2663131.800332] LustreError: 68703:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0xc90f:0x0]: rc = -2 [2663136.343377] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1697e:0x0]: rc = -2 [2663136.355987] LustreError: 100694:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message [2663136.905828] LustreError: 66683:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17604:0x0] not found: rc = -2 [2663137.608393] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1764d:0x0] not found: rc = -2 [2663137.620611] LustreError: 66810:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 3 previous similar messages [2663138.330280] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1771d:0x0]: rc = -2 [2663138.342891] LustreError: 126913:0:(mdd_object.c:3249:mdd_close()) Skipped 3 previous similar messages [2663138.705822] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17780:0x0] not found: rc = -2 [2663138.717962] LustreError: 67056:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 6 previous similar messages [2663140.353575] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17904:0x0]: rc = -2 [2663140.366103] LustreError: 67074:0:(mdd_object.c:3249:mdd_close()) Skipped 5 previous similar messages [2663140.950475] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17943:0x0] not found: rc = -2 [2663140.962567] LustreError: 66692:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 12 previous similar messages [2663144.507278] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x17c2b:0x0]: rc = -2 [2663144.519822] LustreError: 67061:0:(mdd_object.c:3249:mdd_close()) Skipped 7 previous similar messages [2663145.078058] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x17c57:0x0] not found: rc = -2 [2663145.090149] LustreError: 66755:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 18 previous similar messages [2663152.600233] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x182ec:0x0]: rc = -2 [2663152.612774] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 37 previous similar messages [2663153.134962] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1836f:0x0] not found: rc = -2 [2663153.147065] LustreError: 66753:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 38 previous similar messages [2663175.740735] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x18bf2:0x0] not found: rc = -2 [2663175.752900] LustreError: 67051:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 63 previous similar messages [2663176.789503] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x19344:0x0]: rc = -2 [2663176.802031] LustreError: 66965:0:(mdd_object.c:3249:mdd_close()) Skipped 50 previous similar messages [2663207.817889] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1aa2a:0x0] not found: rc = -2 [2663207.829992] LustreError: 66659:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 32 previous similar messages [2663209.047366] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1a8fc:0x0]: rc = -2 [2663209.059905] LustreError: 67065:0:(mdd_object.c:3249:mdd_close()) Skipped 22 previous similar messages [2663271.909174] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044520:0x1db03:0x0] not found: rc = -2 [2663271.921256] LustreError: 66627:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 113 previous similar messages [2663273.161523] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044520:0x1dbf1:0x0]: rc = -2 [2663273.174067] LustreError: 74686:0:(mdd_object.c:3249:mdd_close()) Skipped 94 previous similar messages [2663403.135529] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0003: object [0x280044522:0x4449:0x0] not found: rc = -2 [2663403.147555] LustreError: 66218:0:(mdd_object.c:400:mdd_xattr_get()) Skipped 378 previous similar messages [2663403.645800] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) fir-MDD0003: failed to get lu_attr of [0x280044522:0x4bea:0x0]: rc = -2 [2663403.658315] LustreError: 100695:0:(mdd_object.c:3249:mdd_close()) Skipped 323 previous similar messages [2663455.535626] LustreError: 66692:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0003-mdtlov: invalid FID [0x0:0x0:0x0] [2663455.545914] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: [2663455.560424] LustreError: 66692:0:(lu_object.c:146:lu_object_put()) LBUG [2663455.567250] Pid: 66692, comm: mdt02_035 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 [2663455.577613] Call Trace: [2663455.580267] [<ffffffffc0bc87cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [2663455.587021] [<ffffffffc0bc887c>] lbug_with_loc+0x4c/0xa0 [libcfs] [2663455.593439] [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass] [2663455.600209] [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass] [2663455.607502] [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [2663455.614604] [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass] [2663455.621293] [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt] [2663455.627727] [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [2663455.635193] [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt] [2663455.641939] [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt] [2663455.648176] [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [2663455.654933] [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt] [2663455.660810] [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [2663455.667959] [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2663455.675866] [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2663455.682391] [<ffffffffb5ec2e81>] kthread+0xd1/0xe0 [2663455.687479] [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21 [2663455.694143] [<ffffffffffffffff>] 0xffffffffffffffff [2663455.699346] Kernel panic - not syncing: LBUG [2663455.703792] CPU: 22 PID: 66692 Comm: mdt02_035 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 [2663455.716670] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.12.2 11/15/2019 [2663455.724504] Call Trace: [2663455.727152] [<ffffffffb6565147>] dump_stack+0x19/0x1b [2663455.732474] [<ffffffffb655e850>] panic+0xe8/0x21f [2663455.737460] [<ffffffffc0bc88cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [2663455.743849] [<ffffffffc0d12fd6>] lu_object_put+0x336/0x3e0 [obdclass] [2663455.750580] [<ffffffffc0d0f42b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass] [2663455.758261] [<ffffffffc0d13096>] lu_object_put_nocache+0x16/0x20 [obdclass] [2663455.765507] [<ffffffffc0d1329e>] lu_object_find_at+0x1fe/0xa60 [obdclass] [2663455.772569] [<ffffffffc16ccbfe>] ? lod_xattr_get+0xee/0x700 [lod] [2663455.778947] [<ffffffffc0d13b16>] lu_object_find+0x16/0x20 [obdclass] [2663455.785576] [<ffffffffc15af2cb>] mdt_object_find+0x4b/0x170 [mdt] [2663455.791945] [<ffffffffc15c4c88>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt] [2663455.799349] [<ffffffffc15cf1cd>] mdt_reint_migrate+0x8bd/0x11d0 [mdt] [2663455.806077] [<ffffffffc0d270a9>] ? check_unlink_entry+0x19/0xd0 [obdclass] [2663455.813234] [<ffffffffc0d27cf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass] [2663455.820902] [<ffffffffc15cfb63>] mdt_reint_rec+0x83/0x210 [mdt] [2663455.827089] [<ffffffffc15ac273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [2663455.833797] [<ffffffffc15b78d7>] mdt_reint+0x67/0x140 [mdt] [2663455.839683] [<ffffffffc100666a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [2663455.846777] [<ffffffffc0fde0d1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [2663455.854527] [<ffffffffc0bc8bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [2663455.861785] [<ffffffffc0fa944b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [2663455.869656] [<ffffffffc0fa5575>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [2663455.876621] [<ffffffffb5ecfeb4>] ? __wake_up+0x44/0x50 [2663455.882061] [<ffffffffc0facdb4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [2663455.888544] [<ffffffffc0fac280>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [2663455.896114] [<ffffffffb5ec2e81>] kthread+0xd1/0xe0 [2663455.901175] [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40 [2663455.907450] [<ffffffffb6577c24>] ret_from_fork_nospec_begin+0xe/0x21 [2663455.914070] [<ffffffffb5ec2db0>] ? insert_kthread_work+0x40/0x40  
          laisiyao Lai Siyao added a comment -

          This is due to a striped directory layout is broken, and some stripe FID is [0:0:0], such FID is used internally, and should not be used normally, I'll add a patch to fix this.

          laisiyao Lai Siyao added a comment - This is due to a striped directory layout is broken, and some stripe FID is [0:0:0] , such FID is used internally, and should not be used normally, I'll add a patch to fix this.
          pjones Peter Jones added a comment -

          Hongchao

          Could you please investigate?

          Thanks

          Peter

          pjones Peter Jones added a comment - Hongchao Could you please investigate? Thanks Peter

          People

            laisiyao Lai Siyao
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: