Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6189

LustreError: (mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -116

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.5.3
    • None
    • 2
    • 17312

    Description

      This morning within a few hours of each other, we hit this LBUG which caused the MDS to crash. The first time after reboot we had to abort recovery to get lustre back. We have a crashdump from the MDS.

      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.805235] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 375s: evicting client at 4966@gni100 ns: mdt-
      atlas1-MDT0000_UUID lock: ffff881ec6e16c80/0xfc6e8aed747d1af2 lrc: 4/0,0 mode: CR/CR res: [0x2001a597a:0x85:0x0].0 bits 0x2 rrc: 4 type: IBT flags: 0x60200000000020 nid: 4966@gni100 remote: 0x20ee476ee499c158
      expref: 132 pid: 16827 timeout: 4301930544 lvb_type: 0
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.858358] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -1
      16
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.874660] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) LBUG
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.882757] Pid: 16827, comm: mdt00_224
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887151]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887152] Call Trace:
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.891770] [<ffffffffa0407895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.899670] [<ffffffffa0407e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.906710] [<ffffffffa0d4379a>] mdt_intent_reint+0x51a/0x520 [mdt]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.913933] [<ffffffffa0d40c4e>] mdt_intent_policy+0x3ae/0x770 [mdt]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.921281] [<ffffffffa06de2e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.928910] [<ffffffffa0707d0b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.936903] [<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.943544] [<ffffffffa0d41116>] mdt_enqueue+0x46/0xe0 [mdt]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.950094] [<ffffffffa0d4602a>] mdt_handle_common+0x52a/0x1470 [mdt]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.957515] [<ffffffffa0d833e5>] mds_regular_handle+0x15/0x20 [mdt]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.964770] [<ffffffffa0737fe5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.973547] [<ffffffffa04084ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.980677] [<ffffffffa04193cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.988407] [<ffffffffa072f6c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.996116] [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.002774] [<ffffffffa073934d>] ptlrpc_main+0xaed/0x1760 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.009920] [<ffffffffa0738860>] ? ptlrpc_main+0x0/0x1760 [ptlrpc]
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.017040] [<ffffffff8109ab56>] kthread+0x96/0xa0
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.022607] [<ffffffff8100c20a>] child_rip+0xa/0x20
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.028267] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.033930] [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.039782]

      Attachments

        Issue Links

          Activity

            [LU-6189] LustreError: (mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -116
            pjones Peter Jones added a comment -

            As per ORNL ok to close as LU-5934 has landed

            pjones Peter Jones added a comment - As per ORNL ok to close as LU-5934 has landed
            pjones Peter Jones added a comment -

            Good news. Thanks for the update. I will drop the severity to S2 and will continue to monitor in case there are any further complications.

            pjones Peter Jones added a comment - Good news. Thanks for the update. I will drop the severity to S2 and will continue to monitor in case there are any further complications.

            We have rebooted into the new RPMs with the patch. Lustre has started and I will continue to monitor. Thank you for your help.

            Philip

            curtispb Philip B Curtis added a comment - We have rebooted into the new RPMs with the patch. Lustre has started and I will continue to monitor. Thank you for your help. Philip

            Nope. I will get you those crashdumps once I have those instructions and I will see about getting this patched version in place and we will go from there.

            curtispb Philip B Curtis added a comment - Nope. I will get you those crashdumps once I have those instructions and I will see about getting this patched version in place and we will go from there.

            We are running what is in the ORNL git hub. We attempted a upgrade but it failed after a few days. I general don't upgrade the ORNL branch for a few weeks after a upgrade just in case something goes wrong.

            simmonsja James A Simmons added a comment - We are running what is in the ORNL git hub. We attempted a upgrade but it failed after a few days. I general don't upgrade the ORNL branch for a few weeks after a upgrade just in case something goes wrong.

            No, I do not have instructions for the ftp site. That is correct, we are at the tip of the code there.

            curtispb Philip B Curtis added a comment - No, I do not have instructions for the ftp site. That is correct, we are at the tip of the code there.
            pjones Peter Jones added a comment -

            Philip

            This is a patch that needs to be applied to the MDS only. Is there anything else that you need from us at this point before attempting to bring the filesystem back up?

            Peter

            pjones Peter Jones added a comment - Philip This is a patch that needs to be applied to the MDS only. Is there anything else that you need from us at this point before attempting to bring the filesystem back up? Peter

            I'm quite sure this is fixed with http://review.whamcloud.com/#/c/12828/

            bzzz Alex Zhuravlev added a comment - I'm quite sure this is fixed with http://review.whamcloud.com/#/c/12828/
            pjones Peter Jones added a comment -

            ok. I think that it is best to start uploading the crash dump to our ftp site in case that is useful. Do you have the instructions on how to do that? Also, is the code being run exactly in sync with the tip of your b2_5 branch on gut hub? https://github.com/ORNL-TechInt/lustre/commits/b2_5

            pjones Peter Jones added a comment - ok. I think that it is best to start uploading the crash dump to our ftp site in case that is useful. Do you have the instructions on how to do that? Also, is the code being run exactly in sync with the tip of your b2_5 branch on gut hub? https://github.com/ORNL-TechInt/lustre/commits/b2_5

            Peter

            No, the first time this occurred lustre was restarted. I haven't brought lustre back up this time since this was following so closely to the first time. I wanted to get Intel involved before I attempted another start.

            Philip

            curtispb Philip B Curtis added a comment - Peter No, the first time this occurred lustre was restarted. I haven't brought lustre back up this time since this was following so closely to the first time. I wanted to get Intel involved before I attempted another start. Philip

            People

              pjones Peter Jones
              curtispb Philip B Curtis
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: