Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • 8703

    Description

      Hi,

      We have been testing v2.4 and have hit this LBUG which we have never experienced in v1.8.x for similar workloads. It looks like it is related to do an rm/unlink on certain files. I had to abort recovery and stop the ongoing file deletion in order to keep the MDS from repeatedly crashing with the same LBUG. We can supply more debug info should you need it.

      Cheers,

      Daire

      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) LBUG
      <4>Pid: 6274, comm: mdt01_004
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6274, comm: mdt01_004 Tainted: G --------------- T 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8150d878>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] ? linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] ? mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] ? mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] ? mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] ? mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] ? mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] ? mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] ? mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] ? mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] ? mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] ? mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-3474] MDS LBUG on unlink?

            I think I had mentionned #6772 in this ticket, but anyway I better had, as Di suggested at that time, to merge both patches to avoid such oversight!

            bfaccini Bruno Faccini (Inactive) added a comment - I think I had mentionned #6772 in this ticket, but anyway I better had, as Di suggested at that time, to merge both patches to avoid such oversight!
            adilger Andreas Dilger added a comment - - edited Cherry-pick http://review.whamcloud.com/6772 to b2_4: http://review.whamcloud.com/10464

            It seems that http://review.whamcloud.com/6676 was landed to b2_4 for 2.4.1, but http://review.whamcloud.com/6772 (which was not mentioned anywhere in this bug, but attributed to LU-3474) was only landed to master for 2.4.52 and not b2_4. This results in "lfs fid2path" on old IGIF FIDs with 2.4.2 servers to incorrectly return success if there is no linkEA, but only prints the root path:

            lfs fid2path /myth [0x10b466:0xfce641b5:0x0]
            /myth//
            

            There is never a linkEA for upgraded 1.x files until LFSCK 1.5 is run on a 2.5+ MDS.

            adilger Andreas Dilger added a comment - It seems that http://review.whamcloud.com/6676 was landed to b2_4 for 2.4.1, but http://review.whamcloud.com/6772 (which was not mentioned anywhere in this bug, but attributed to LU-3474 ) was only landed to master for 2.4.52 and not b2_4. This results in " lfs fid2path " on old IGIF FIDs with 2.4.2 servers to incorrectly return success if there is no linkEA, but only prints the root path: lfs fid2path /myth [0x10b466:0xfce641b5:0x0] /myth// There is never a linkEA for upgraded 1.x files until LFSCK 1.5 is run on a 2.5+ MDS.

            Patch landed to Master. Closing ticket. Please let me know if more work is needed and I will reopen.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Closing ticket. Please let me know if more work is needed and I will reopen.

            FWIW, I've applied #6672 and #6676 and have not hit the issue with our test workload (we did hit it repeatedly without #6676).

            prakash Prakash Surya (Inactive) added a comment - FWIW, I've applied #6672 and #6676 and have not hit the issue with our test workload (we did hit it repeatedly without #6676).

            Bruno,

            I have patched it in and haven't seen the issue again yet. However I have not had the opportunity to run the same workload (large unlinks) but that should happen between now and next week. I will update if we have any further issues. Thanks for the help.

            daire Daire Byrne (Inactive) added a comment - Bruno, I have patched it in and haven't seen the issue again yet. However I have not had the opportunity to run the same workload (large unlinks) but that should happen between now and next week. I will update if we have any further issues. Thanks for the help.

            Daire, have you been able to test patch-set #4 of http://review.whamcloud.com/6676 finally ??

            bfaccini Bruno Faccini (Inactive) added a comment - Daire, have you been able to test patch-set #4 of http://review.whamcloud.com/6676 finally ??

            Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            bfaccini Bruno Faccini (Inactive) added a comment - Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0:

            1. cd /usr/src/lustre-2.4.0/
            2. patch -p1 < mdd_dir.c.patch
            3. patch -p1 < /tmp/mdt_handler.c.patch
            4. ./configure
            5. make rpms

            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG
            Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010
            Jul 3 12:36:50 bmds1 kernel:
            Jul 3 12:36:50 bmds1 kernel: Call Trace:
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            daire Daire Byrne (Inactive) added a comment - I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0: cd /usr/src/lustre-2.4.0/ patch -p1 < mdd_dir.c.patch patch -p1 < /tmp/mdt_handler.c.patch ./configure make rpms Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed: Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010 Jul 3 12:36:50 bmds1 kernel: Jul 3 12:36:50 bmds1 kernel: Call Trace: Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            Hello Daniel,
            Thank's for the feed-back too !
            But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Daniel, Thank's for the feed-back too ! But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            People

              bfaccini Bruno Faccini (Inactive)
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: