Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • 8703

    Description

      Hi,

      We have been testing v2.4 and have hit this LBUG which we have never experienced in v1.8.x for similar workloads. It looks like it is related to do an rm/unlink on certain files. I had to abort recovery and stop the ongoing file deletion in order to keep the MDS from repeatedly crashing with the same LBUG. We can supply more debug info should you need it.

      Cheers,

      Daire

      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) LBUG
      <4>Pid: 6274, comm: mdt01_004
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6274, comm: mdt01_004 Tainted: G --------------- T 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8150d878>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] ? linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] ? mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] ? mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] ? mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] ? mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] ? mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] ? mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] ? mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] ? mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] ? mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] ? mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-3474] MDS LBUG on unlink?

            Patch landed to Master. Closing ticket. Please let me know if more work is needed and I will reopen.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Closing ticket. Please let me know if more work is needed and I will reopen.

            FWIW, I've applied #6672 and #6676 and have not hit the issue with our test workload (we did hit it repeatedly without #6676).

            prakash Prakash Surya (Inactive) added a comment - FWIW, I've applied #6672 and #6676 and have not hit the issue with our test workload (we did hit it repeatedly without #6676).

            Bruno,

            I have patched it in and haven't seen the issue again yet. However I have not had the opportunity to run the same workload (large unlinks) but that should happen between now and next week. I will update if we have any further issues. Thanks for the help.

            daire Daire Byrne (Inactive) added a comment - Bruno, I have patched it in and haven't seen the issue again yet. However I have not had the opportunity to run the same workload (large unlinks) but that should happen between now and next week. I will update if we have any further issues. Thanks for the help.

            Daire, have you been able to test patch-set #4 of http://review.whamcloud.com/6676 finally ??

            bfaccini Bruno Faccini (Inactive) added a comment - Daire, have you been able to test patch-set #4 of http://review.whamcloud.com/6676 finally ??

            Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            bfaccini Bruno Faccini (Inactive) added a comment - Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0:

            1. cd /usr/src/lustre-2.4.0/
            2. patch -p1 < mdd_dir.c.patch
            3. patch -p1 < /tmp/mdt_handler.c.patch
            4. ./configure
            5. make rpms

            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG
            Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010
            Jul 3 12:36:50 bmds1 kernel:
            Jul 3 12:36:50 bmds1 kernel: Call Trace:
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            daire Daire Byrne (Inactive) added a comment - I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0: cd /usr/src/lustre-2.4.0/ patch -p1 < mdd_dir.c.patch patch -p1 < /tmp/mdt_handler.c.patch ./configure make rpms Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed: Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010 Jul 3 12:36:50 bmds1 kernel: Jul 3 12:36:50 bmds1 kernel: Call Trace: Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            Hello Daniel,
            Thank's for the feed-back too !
            But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Daniel, Thank's for the feed-back too ! But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            Hello

            In our site, we had this problem with 2.4.50 version. It was produced moving a dir with lots of files to other destination.

            I can confirm that Patch-set #2 has fixed the problem.

            dbasabe Daniel Basabe (Inactive) added a comment - Hello In our site, we had this problem with 2.4.50 version. It was produced moving a dir with lots of files to other destination. I can confirm that Patch-set #2 has fixed the problem.
            spitzcor Cory Spitz added a comment -

            I was mistaken, Cray has not yet tested w/6772 applied. However, 6676 ps1 did test successfully.

            spitzcor Cory Spitz added a comment - I was mistaken, Cray has not yet tested w/6772 applied. However, 6676 ps1 did test successfully.

            Thank's for the feed-back Cory, #6676 patch-set #2 should fix the LBUG AND the annoying (and erroneous!) msgs ...

            bfaccini Bruno Faccini (Inactive) added a comment - Thank's for the feed-back Cory, #6676 patch-set #2 should fix the LBUG AND the annoying (and erroneous!) msgs ...
            spitzcor Cory Spitz added a comment -

            Cray testing on change #6676 ps1 and 6772 shows that the changes resolve our problems with the LBUG.

            spitzcor Cory Spitz added a comment - Cray testing on change #6676 ps1 and 6772 shows that the changes resolve our problems with the LBUG.

            People

              bfaccini Bruno Faccini (Inactive)
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: