Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2927

mdt_reconstruct_open() ASSERTION failure

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • 7028

    Description

      transno shouldn't be cleared for EREMOTE operation.

      Attachments

        Activity

          [LU-2927] mdt_reconstruct_open() ASSERTION failure

          Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan.

          nedbass Ned Bass (Inactive) added a comment - Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan.
          pjones Peter Jones added a comment -

          Landed for 2.4

          pjones Peter Jones added a comment - Landed for 2.4

          Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc?

          adilger Andreas Dilger added a comment - Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc?
          di.wang Di Wang added a comment - http://review.whamcloud.com/#change,5694

          How do you know the assertion was wrong?

          keith Keith Mannthey (Inactive) added a comment - How do you know the assertion was wrong?

          Original call trace:

          Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
          Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG
          Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500
          Jan 21 09:34:50 snx11026n003 kernel: [999566.305899]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace:
          Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70
          Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20
          Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
          Jan 21 09:34:50 snx11026n003 kernel: [999566.479105]
          Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG
          

          The LBUG was hit with single MDT. It is weird by itself.
          The idea was to fix assertion.

          askulysh Andriy Skulysh added a comment - Original call trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed: Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500 Jan 21 09:34:50 snx11026n003 kernel: [999566.305899] Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70 Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.479105] Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG The LBUG was hit with single MDT. It is weird by itself. The idea was to fix assertion.

          Dropping this from the blocker list. The patch is incorrect and we have no information about how this bug was hit or the symptoms of the failure (stack trace, error logs, etc), or how often it is hit, so no way to know how common or rare the problem is.

          adilger Andreas Dilger added a comment - Dropping this from the blocker list. The patch is incorrect and we have no information about how this bug was hit or the symptoms of the failure (stack trace, error logs, etc), or how often it is hit, so no way to know how common or rare the problem is.

          How was this bug hit, and how often is it seen? Ideally there would also be a test case for this.

          adilger Andreas Dilger added a comment - How was this bug hit, and how often is it seen? Ideally there would also be a test case for this.
          askulysh Andriy Skulysh added a comment - PATCH: http://review.whamcloud.com/5632

          People

            di.wang Di Wang
            askulysh Andriy Skulysh
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: