Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0, Lustre 2.5.0
    • 3
    • 8703

    Description

      Hi,

      We have been testing v2.4 and have hit this LBUG which we have never experienced in v1.8.x for similar workloads. It looks like it is related to do an rm/unlink on certain files. I had to abort recovery and stop the ongoing file deletion in order to keep the MDS from repeatedly crashing with the same LBUG. We can supply more debug info should you need it.

      Cheers,

      Daire

      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
      <0>LustreError: 6274:0:(linkea.c:169:linkea_links_find()) LBUG
      <4>Pid: 6274, comm: mdt01_004
      <4>
      <4>Call Trace:
      <4> [<ffffffffa044b895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      <4> [<ffffffffa044be97>] lbug_with_loc+0x47/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      <4>
      <0>Kernel panic - not syncing: LBUG
      <4>Pid: 6274, comm: mdt01_004 Tainted: G --------------- T 2.6.32-358.6.2.el6_lustre.g230b174.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff8150d878>] ? panic+0xa7/0x16f
      <4> [<ffffffffa044beeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      <4> [<ffffffffa05b47d6>] ? linkea_links_find+0x186/0x190 [obdclass]
      <4> [<ffffffffa0b87206>] ? mdo_xattr_get+0x26/0x30 [mdd]
      <4> [<ffffffffa0b8a645>] ? mdd_linkea_prepare+0x95/0x430 [mdd]
      <4> [<ffffffffa0b8ab01>] ? mdd_links_rename+0x121/0x540 [mdd]
      <4> [<ffffffffa0b8eae6>] ? mdd_unlink+0xb86/0xe30 [mdd]
      <4> [<ffffffffa0e0db98>] ? mdo_unlink+0x18/0x50 [mdt]
      <4> [<ffffffffa0e10f40>] ? mdt_reint_unlink+0x820/0x1010 [mdt]
      <4> [<ffffffffa0e0d891>] ? mdt_reint_rec+0x41/0xe0 [mdt]
      <4> [<ffffffffa0df2b03>] ? mdt_reint_internal+0x4c3/0x780 [mdt]
      <4> [<ffffffffa0df2e04>] ? mdt_reint+0x44/0xe0 [mdt]
      <4> [<ffffffffa0df7ab8>] ? mdt_handle_common+0x648/0x1660 [mdt]
      <4> [<ffffffffa0e31165>] ? mds_regular_handle+0x15/0x20 [mdt]
      <4> [<ffffffffa0730388>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
      <4> [<ffffffffa044c5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
      <4> [<ffffffffa045dd8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
      <4> [<ffffffffa07276e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
      <4> [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
      <4> [<ffffffffa073171e>] ? ptlrpc_main+0xace/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffffa0730c50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
      <4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

      Attachments

        Issue Links

          Activity

            [LU-3474] MDS LBUG on unlink?

            Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            bfaccini Bruno Faccini (Inactive) added a comment - Wow I am sorry Daire, I don't know how this happen but patch-set#3 of http://review.whamcloud.com/6676 contained a regression from patch-set #1/#2 (in fact it did not contain the main part/change from patch-set #1 that must be in to prevent the LBUG!!) .... Can you give a try to patch-set #4 that should be definitive one ??

            I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0:

            1. cd /usr/src/lustre-2.4.0/
            2. patch -p1 < mdd_dir.c.patch
            3. patch -p1 < /tmp/mdt_handler.c.patch
            4. ./configure
            5. make rpms

            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed:
            Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG
            Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010
            Jul 3 12:36:50 bmds1 kernel:
            Jul 3 12:36:50 bmds1 kernel: Call Trace:
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
            Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
            Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            daire Daire Byrne (Inactive) added a comment - I finally got around to testing the two patches - the LBUG has returned. I patched v2.4.0: cd /usr/src/lustre-2.4.0/ patch -p1 < mdd_dir.c.patch patch -p1 < /tmp/mdt_handler.c.patch ./configure make rpms Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) ASSERTION( ldata->ld_leh != ((void *)0) ) failed: Jul 3 12:36:50 bmds1 kernel: LustreError: 13174:0:(linkea.c:169:linkea_links_find()) LBUG Jul 3 12:36:50 bmds1 kernel: Pid: 13174, comm: mdt01_010 Jul 3 12:36:50 bmds1 kernel: Jul 3 12:36:50 bmds1 kernel: Call Trace: Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043c895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043ce97>] lbug_with_loc+0x47/0xb0 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa05a47d6>] linkea_links_find+0x186/0x190 [obdclass] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b65206>] ? mdo_xattr_get+0x26/0x30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68645>] mdd_linkea_prepare+0x95/0x430 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b68b01>] mdd_links_rename+0x121/0x520 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0b6cac6>] mdd_unlink+0xb86/0xe30 [mdd] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dddb88>] mdo_unlink+0x18/0x50 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0de0f30>] mdt_reint_unlink+0x820/0x1010 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0ddd881>] mdt_reint_rec+0x41/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2b03>] mdt_reint_internal+0x4c3/0x780 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc2e04>] mdt_reint+0x44/0xe0 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0dc7ab8>] mdt_handle_common+0x648/0x1660 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa0e01155>] mds_regular_handle+0x15/0x20 [mdt] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa071e388>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa043d5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa044ed8f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] Jul 3 12:36:50 bmds1 kernel: [<ffffffffa07156e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] Jul 3 12:36:50 bmds1 kernel: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071f71e>] ptlrpc_main+0xace/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] Jul 3 12:36:51 bmds1 kernel: [<ffffffffa071ec50>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]

            Hello Daniel,
            Thank's for the feed-back too !
            But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            bfaccini Bruno Faccini (Inactive) added a comment - Hello Daniel, Thank's for the feed-back too ! But patch-set #2 was not in accordance with the error reporting rules being used, so I just pushed patch-set #3 to fix that.

            Hello

            In our site, we had this problem with 2.4.50 version. It was produced moving a dir with lots of files to other destination.

            I can confirm that Patch-set #2 has fixed the problem.

            dbasabe Daniel Basabe (Inactive) added a comment - Hello In our site, we had this problem with 2.4.50 version. It was produced moving a dir with lots of files to other destination. I can confirm that Patch-set #2 has fixed the problem.
            spitzcor Cory Spitz added a comment -

            I was mistaken, Cray has not yet tested w/6772 applied. However, 6676 ps1 did test successfully.

            spitzcor Cory Spitz added a comment - I was mistaken, Cray has not yet tested w/6772 applied. However, 6676 ps1 did test successfully.

            Thank's for the feed-back Cory, #6676 patch-set #2 should fix the LBUG AND the annoying (and erroneous!) msgs ...

            bfaccini Bruno Faccini (Inactive) added a comment - Thank's for the feed-back Cory, #6676 patch-set #2 should fix the LBUG AND the annoying (and erroneous!) msgs ...
            spitzcor Cory Spitz added a comment -

            Cray testing on change #6676 ps1 and 6772 shows that the changes resolve our problems with the LBUG.

            spitzcor Cory Spitz added a comment - Cray testing on change #6676 ps1 and 6772 shows that the changes resolve our problems with the LBUG.

            Just pushed new version/patch-set #2 of change http://review.whamcloud.com/6676. It adds a few ENODATA error handling fixes, to avoid unnecessary msgs and also prevent early return, to original fix.

            And http://review.whamcloud.com/6772 is cosmetic patch for similar linkea_init() error handling in mdt layer.

            bfaccini Bruno Faccini (Inactive) added a comment - Just pushed new version/patch-set #2 of change http://review.whamcloud.com/6676 . It adds a few ENODATA error handling fixes, to avoid unnecessary msgs and also prevent early return, to original fix. And http://review.whamcloud.com/6772 is cosmetic patch for similar linkea_init() error handling in mdt layer.

            And we have a 2.1 formatted FS upgraded to Lustre 2.4 RPMs.

            prakash Prakash Surya (Inactive) added a comment - And we have a 2.1 formatted FS upgraded to Lustre 2.4 RPMs.

            Cray sees the bug on a file system formatted with 2.4.

            amk Ann Koehler (Inactive) added a comment - Cray sees the bug on a file system formatted with 2.4.

            The filesystem was formatted using the latest v2.3 release so many of the hardlinks would have been created under that version.

            daire Daire Byrne (Inactive) added a comment - The filesystem was formatted using the latest v2.3 release so many of the hardlinks would have been created under that version.

            People

              bfaccini Bruno Faccini (Inactive)
              daire Daire Byrne (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: