Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2927

mdt_reconstruct_open() ASSERTION failure

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7028

    Description

      transno shouldn't be cleared for EREMOTE operation.

      Attachments

        Activity

          [LU-2927] mdt_reconstruct_open() ASSERTION failure

          Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan.

          nedbass Ned Bass (Inactive) added a comment - Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan.
          pjones Peter Jones added a comment -

          Landed for 2.4

          pjones Peter Jones added a comment - Landed for 2.4

          Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc?

          adilger Andreas Dilger added a comment - Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc?
          di.wang Di Wang added a comment - http://review.whamcloud.com/#change,5694

          How do you know the assertion was wrong?

          keith Keith Mannthey (Inactive) added a comment - How do you know the assertion was wrong?

          Original call trace:

          Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
          Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG
          Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500
          Jan 21 09:34:50 snx11026n003 kernel: [999566.305899]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace:
          Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70
          Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20
          Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
          Jan 21 09:34:50 snx11026n003 kernel: [999566.479105]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG
          

          The LBUG was hit with single MDT. It is weird by itself.
          The idea was to fix assertion.

          askulysh Andriy Skulysh added a comment - Original call trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed: Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500 Jan 21 09:34:50 snx11026n003 kernel: [999566.305899] Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70 Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.479105] Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG The LBUG was hit with single MDT. It is weird by itself. The idea was to fix assertion.

          People

            di.wang Di Wang
            askulysh Andriy Skulysh
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: