Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13511

MDS 2.12.4 ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Lustre 2.12.4
    • Lustre 2.14.0, Lustre 2.12.6
    • None
    • CentOS 7.6 3.10.0-957.27.2.el7_lustre.pl2.x86_64
    • 3
    • 9223372036854775807

    Description

      We have been running lfs migrate -m 1 on a client for several days now to free up inodes on a MDT, but when trying to launch multiple lfs migrate -m 1 (like more than 4) on different directory trees, at the same time, on a (single) client, we ended up crashing the MDS of fir-MDT0001 with the following assertion:

      [Fri May  1 16:46:48 2020][3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]^M
      [Fri May  1 16:46:48 2020][3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed: ^M
      

      backtrace:

      [3824898.399369] Lustre: fir-MDT0001: Connection restored to 7862f6c9-0098-4 (at 10.50.8.41@o2ib2)
      [3824911.223375] LustreError: 22403:0:(lod_dev.c:132:lod_fld_lookup()) fir-MDT0001-mdtlov: invalid FID [0x0:0x0:0x0]
      [3824911.233641] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) ASSERTION( top->loh_hash.next == ((void *)0) && top->loh_hash.pprev == ((void *)0) ) failed:
      [3824911.248150] LustreError: 22403:0:(lu_object.c:146:lu_object_put()) LBUG
      [3824911.254941] Pid: 22403, comm: mdt00_022 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [3824911.265305] Call Trace:
      [3824911.267941]  [<ffffffffc0c9b7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [3824911.274687]  [<ffffffffc0c9b87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [3824911.281077]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
      [3824911.287838]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
      [3824911.295127]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
      [3824911.302240]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
      [3824911.308908]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
      [3824911.315306]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
      [3824911.322778]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
      [3824911.329526]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
      [3824911.335765]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [3824911.342508]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
      [3824911.348401]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [3824911.355537]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [3824911.363446]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [3824911.369949]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
      [3824911.375036]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [3824911.381683]  [<ffffffffffffffff>] 0xffffffffffffffff
      [3824911.386893] Kernel panic - not syncing: LBUG
      [3824911.391339] CPU: 28 PID: 22403 Comm: mdt00_022 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [3824911.404191] Hardware name: Dell Inc. PowerEdge R6415/065PKD, BIOS 1.10.6 08/15/2019
      [3824911.412016] Call Trace:
      [3824911.414649]  [<ffffffffb0b65147>] dump_stack+0x19/0x1b
      [3824911.419967]  [<ffffffffb0b5e850>] panic+0xe8/0x21f
      [3824911.424938]  [<ffffffffc0c9b8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [3824911.431323]  [<ffffffffc0e9ff66>] lu_object_put+0x336/0x3e0 [obdclass]
      [3824911.438044]  [<ffffffffc0e9c39b>] ? lu_object_start.isra.35+0x8b/0x120 [obdclass]
      [3824911.445715]  [<ffffffffc0ea0026>] lu_object_put_nocache+0x16/0x20 [obdclass]
      [3824911.452951]  [<ffffffffc0ea022e>] lu_object_find_at+0x1fe/0xa60 [obdclass]
      [3824911.460011]  [<ffffffffc1830a7e>] ? lod_xattr_get+0xee/0x700 [lod]
      [3824911.466387]  [<ffffffffc0ea0aa6>] lu_object_find+0x16/0x20 [obdclass]
      [3824911.473014]  [<ffffffffc17130db>] mdt_object_find+0x4b/0x170 [mdt]
      [3824911.479378]  [<ffffffffc1728ab8>] mdt_migrate_lookup.isra.40+0x158/0xa60 [mdt]
      [3824911.486780]  [<ffffffffc1732eba>] mdt_reint_migrate+0x8ea/0x1310 [mdt]
      [3824911.493499]  [<ffffffffc0eb3fa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
      [3824911.500654]  [<ffffffffc0eb4bf8>] ? upcall_cache_get_entry+0x218/0x8b0 [obdclass]
      [3824911.508318]  [<ffffffffc1733963>] mdt_reint_rec+0x83/0x210 [mdt]
      [3824911.514503]  [<ffffffffc1710273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [3824911.521213]  [<ffffffffc171b6e7>] mdt_reint+0x67/0x140 [mdt]
      [3824911.527097]  [<ffffffffc11e464a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [3824911.534180]  [<ffffffffc11bc0b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [3824911.541925]  [<ffffffffc0c9bbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [3824911.549176]  [<ffffffffc118743b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [3824911.557038]  [<ffffffffc1183565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [3824911.564004]  [<ffffffffb04cfeb4>] ? __wake_up+0x44/0x50
      [3824911.569438]  [<ffffffffc118ada4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [3824911.575911]  [<ffffffffc118a270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [3824911.583478]  [<ffffffffb04c2e81>] kthread+0xd1/0xe0
      [3824911.588528]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
      [3824911.594796]  [<ffffffffb0b77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [3824911.601407]  [<ffffffffb04c2db0>] ? insert_kthread_work+0x40/0x40
      

      Note: please ignore the following lines in the logs, they are not relevant, it's just a script that tried periodically to access some wrong sysfs files (ie. it is not a backend device error):

      mpt3sas_cm0: log_info(0x31200205): originator(PL), code(0x20), sub_code(0x0205)
      

       

      Shortly before the crash, we can see the following lines in syslog:

      [3824620.448341] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x2400576ec:0x149ae:0x0]: rc = -2
      
      [3824640.928979] LustreError: 42016:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xcf72:0x0]: rc = -2
      [3824658.778134] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xe2f2:0x0]: rc = -2
      [3824678.792561] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0xf468:0x0]: rc = -2
      [3824696.395767] LustreError: 22546:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x10477:0x0]: rc = -2
      [3824714.310806] LustreError: 42123:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x11206:0x0]: rc = -2
      [3824730.506605] LustreError: 42121:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x120e7:0x0]: rc = -2
      
      [3824768.104569] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x13f7d:0x0]: rc = -2
      [3824768.117096] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) Skipped 1 previous similar message
      [3824822.439361] LustreError: 28226:0:(mdd_object.c:400:mdd_xattr_get()) fir-MDD0001: object [0x240057703:0x16d59:0x0] not found: rc = -2
      [3824840.123675] LustreError: 22344:0:(mdd_object.c:3249:mdd_close()) fir-MDD0001: failed to get lu_attr of [0x240057703:0x17b2e:0x0]: rc = -2
      

      I tried a fid2path on those FIDs from a client, but they cannot be found.

      This issue has occurred only once, on May 1. I'm attaching vmcore-dmesg.txt as fir-md1-s2_20200501_vmcore-dmesg.txt.

      vmcore:

            KERNEL: /usr/lib/debug/lib/modules/3.10.0-957.27.2.el7_lustre.pl2.x86_64/vmlinux
          DUMPFILE: vmcore  [PARTIAL DUMP]
              CPUS: 48
              DATE: Fri May  1 16:46:48 2020
            UPTIME: 44 days, 06:26:43
      LOAD AVERAGE: 1.98, 1.74, 1.43
             TASKS: 1919
          NODENAME: fir-md1-s2
           RELEASE: 3.10.0-957.27.2.el7_lustre.pl2.x86_64
           VERSION: #1 SMP Thu Nov 7 15:26:16 PST 2019
           MACHINE: x86_64  (1996 Mhz)
            MEMORY: 255.6 GB
             PANIC: "Kernel panic - not syncing: LBUG"
               PID: 22403
           COMMAND: "mdt00_022"
              TASK: ffff8b0d6f65d140  [THREAD_INFO: ffff8afd6cfc4000]
               CPU: 28
             STATE: TASK_RUNNING (PANIC)
      

      I have uploaded this vmcore to WC's FTP server as fir-md1-s2_20200501164658_vmcore
      Also attached the output of "foreach bt" as fir-md1-s2_crash_foreach_bt_20200501164658.txt

      Let me know if you need anything else that could help in avoiding that in the future. Thanks!

      Attachments

        Activity

          People

            laisiyao Lai Siyao
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: