Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6334

racer test_1: ldlm_lock_destroy_internal() LBUG

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.7.0
    • OpenSFS cluster running lustre 2.7.0-RC-4 build # 29 with two MDSs with two MDTs each, three OSSs with two OSTs each and three clients.
    • 3
    • 17740

    Description

      Racer test 1 hangs and second MDS crashes. From the client test log:

      == racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546
      024)
      racers pids: 24447 24448 24449 24450 24452 24455 24457 24463
      c12: ./file_exec.sh: line 12: 32546 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12:  3014 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 22190 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12: 20868 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12: 27709 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12:  4393 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12: 30807 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12: 27894 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12:  6553 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12:  8077 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c13: ./file_exec.sh: line 12: 15490 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 25110 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 22466 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 13403 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 31697 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12:  4156 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 13081 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 13991 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 31225 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12:  5779 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 23907 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c12: ./file_exec.sh: line 12: 14579 Segmentation fault      $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 13126 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 12056 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 12121 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 21271 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 26034 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 24871 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 27385 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  2404 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  5912 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  6463 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  8118 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 10024 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 10552 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 17440 Segmentation fault      (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 21595 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  3143 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  9341 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12: 18994 Bus error               (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_exec.sh: line 12:  2735 Bus error               $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
      c11: ./file_mknod.sh: line 11:  5348 Terminated              $MCREATE $file 2> /dev/null
      c11: ./file_mknod.sh: line 11:  5349 Terminated              $MCREATE $file 2> /dev/null
      c11: ./file_mknod.sh: line 11:  5323 Terminated              $MCREATE $file 2> /dev/null
      …
      

      The second MDS crashed with the following on the console:

      Message from syslogd@mds02 at Mar  5 01:02:21 ...
       kernel:LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG
      

      In the dmesg before the crash:

      <4>Lustre: DEBUG MARKER: == racer test 1: racer on clients: c11,c12,c13 DURATION=900 == 01:00:24 (1425546024)
      <3>LustreError: 25276:0:(mdt_reint.c:1575:mdt_reint_migrate_internal()) scratch-MDT0001: can not migrate striped dir [0x3c0002b17:0xa:0x0]: rc = -1
      <3>LustreError: 10002:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0003: [0x4400013a9:0x4:0x0]0 is already opened count 1: rc = -16
      <3>LustreError: 9999:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0001: parent [0x3c0002b10:0x41e:0x0] is still on the same MDT, which should be migrated first: rc = -1
      <3>LustreError: 9993:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0001: source inode [0x400002347:0x25:0x0] on remote MDT from [0x3c0002b12:0x7:0x0]
      <3>LustreError: 9993:0:(mdd_dir.c:4034:mdd_migrate()) scratch-MDD0001: [0x3c0002b17:0x6:0x0]13 is already opened count 1: rc = -16
      <3>LustreError: 25282:0:(mdt_reint.c:1536:mdt_reint_migrate_internal()) scratch-MDT0003: parent [0x4400013a0:0xa:0x0] is still on the same MDT, which should be migrated first: rc = -1
      <3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) scratch-MDT0003: source inode [0x400002348:0x3c:0x0] on remote MDT from [0x4400013aa:0x1:0x0]
      <3>LustreError: 27158:0:(mdt_reint.c:1173:mdt_reint_link()) Skipped 1 previous similar message
      …
      

      and call trace:

      <3>LustreError: 24677:0:(ldlm_lock.c:363:ldlm_lock_destroy_internal()) ### lock 
      still on resource ns: mdt-scratch-MDT0003_UUID lock: ffff8806b331dd00/0x652e27c3
      5471591f lrc: 3/0,0 mode: PR/PR res: [0x4400013a5:0x1e94:0x0].0 bits 0x20 rrc: 1
       type: IBT flags: 0x50200000000000 nid: 192.168.2.111@o2ib remote: 0xe6ac309763e
      dc3fe expref: 30 pid: 9995 timeout: 0 lvb_type: 0
      <3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) scratch-MDT0001: [0x3c0002b14:0x3d8:0x0] doesn't exist!: rc = -14
      <3>LustreError: 9997:0:(mdt_open.c:1564:mdt_cross_open()) Skipped 6 previous similar messages
      <0>LustreError: 24677:0:(ldlm_lock.c:364:ldlm_lock_destroy_internal()) LBUG
      <4>Pid: 24677, comm: mdt00_001
      <4>
      <4>Call Trace:
      <4> [<ffffffffa0565895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa0565e97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa089b2c1>] ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc]
      <4> [<ffffffffa089cef5>] ldlm_lock_destroy+0x35/0x130 [ptlrpc] 
      <4> [<ffffffffa089d525>] ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc]
      <4> [<ffffffffa08c964b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
      <4> [<ffffffffa094a2f1>] tgt_enqueue+0x61/0x230 [ptlrpc]
      <4> [<ffffffffa094af3e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
      <4> [<ffffffffa08fa9b1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
      <4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      <4> [<ffffffff8109abf6>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 24677, comm: mdt00_001 Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff81528fdc>] ? panic+0xa7/0x16f
      <4> [<ffffffffa0565eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa089b2c1>] ? ldlm_lock_destroy_internal+0x251/0x2c0 [ptlrpc]
      <4> [<ffffffffa089cef5>] ? ldlm_lock_destroy+0x35/0x130 [ptlrpc]
      <4> [<ffffffffa089d525>] ? ldlm_lock_enqueue+0x155/0x9d0 [ptlrpc]
      <4> [<ffffffffa08c964b>] ? ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
      <4> [<ffffffffa094a2f1>] ? tgt_enqueue+0x61/0x230 [ptlrpc]
      <4> [<ffffffffa094af3e>] ? tgt_request_handle+0x8be/0x1000 [ptlrpc]
      <4> [<ffffffffa08fa9b1>] ? ptlrpc_main+0xe41/0x1960 [ptlrpc]
      <4> [<ffffffffa08f9b70>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      <4> [<ffffffff8109abf6>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      I will upload vmcore and node logs when they are available.

      Attachments

        Issue Links

          Activity

            [LU-6334] racer test_1: ldlm_lock_destroy_internal() LBUG

            Reopen to change resolution.

            adilger Andreas Dilger added a comment - Reopen to change resolution.

            Moved to LU-7173

            simmonsja James A Simmons added a comment - Moved to LU-7173

            The OOM happened on MDT

            simmonsja James A Simmons added a comment - The OOM happened on MDT

            Hmm, did this happen on MDT or OST?

            di.wang Di Wang (Inactive) added a comment - Hmm, did this happen on MDT or OST?
            simmonsja James A Simmons added a comment - - edited

            Yes I do have vmcore dumps but no lctl debug logs. We believe it happened when a bunch of MPIIO were executed at various scales on titan at slightly varying start times. Once we set the debug logs levels the problem went away Later we went back to the original debug level; plus with ldlm set; to see if we could duplicate the problem. No luck duplicating the problem so we don't know what exact work load caused it.

            Oleg suggest I open a need ticket and link to this one.

            simmonsja James A Simmons added a comment - - edited Yes I do have vmcore dumps but no lctl debug logs. We believe it happened when a bunch of MPIIO were executed at various scales on titan at slightly varying start times. Once we set the debug logs levels the problem went away Later we went back to the original debug level; plus with ldlm set; to see if we could duplicate the problem. No luck duplicating the problem so we don't know what exact work load caused it. Oleg suggest I open a need ticket and link to this one.

            James: Do you have vmcore dump or debug log for the LBUG? If you can tell me how to reproduce the problem, that would be great. Thanks!

            di.wang Di Wang (Inactive) added a comment - James: Do you have vmcore dump or debug log for the LBUG? If you can tell me how to reproduce the problem, that would be great. Thanks!

            We just hit this bug on our 2.5.4 servers during our test shot with 2.8 clients.

            simmonsja James A Simmons added a comment - We just hit this bug on our 2.5.4 servers during our test shot with 2.8 clients.
            pjones Peter Jones added a comment -

            Thanks Di!

            pjones Peter Jones added a comment - Thanks Di!

            No, I did not see this for long time, yes, I can not be closed for now.

            di.wang Di Wang (Inactive) added a comment - No, I did not see this for long time, yes, I can not be closed for now.

            Di, is this something you still see on DNE racer, or could it be closed?

            adilger Andreas Dilger added a comment - Di, is this something you still see on DNE racer, or could it be closed?

            People

              di.wang Di Wang (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: