Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • 4
    • 9223372036854775807

    Description

      No sign or indication, ie lustre-log or error messages, OSS unexpectantly crash (please see console image).

      /var/log/messages is attached

      Attachments

        1. 23-6.png
          23-6.png
          47 kB
        2. log.28119.gz
          388 kB
        3. lustre-logs.tgz
          0.2 kB
        4. messages13
          271 kB
        5. panda-oss-23-6_messages
          1003 kB

        Activity

          [LU-6584] OSS hit LBUG and crash

          We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing.

          Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before.

          rpwagner Rick Wagner (Inactive) added a comment - We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing. Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before.
          bobijam Zhenyu Xu added a comment -

          http://review.whamcloud.com/#/c/14926/ has been updated to add more remote/local buffer check.

          bobijam Zhenyu Xu added a comment - http://review.whamcloud.com/#/c/14926/ has been updated to add more remote/local buffer check.

          Rick, the other possible avenue for debugging is to disable the 1MB blocksize tunable on one or more of your OST datasets, and see if this correlates to a reduction or elimination of the occurrence of this failure. This is one of the main deltas between your ZFS environment and other ZFS users, so this would allow us to isolate the memory corruption to the code handling 1MB blocksize.

          adilger Andreas Dilger added a comment - Rick, the other possible avenue for debugging is to disable the 1MB blocksize tunable on one or more of your OST datasets, and see if this correlates to a reduction or elimination of the occurrence of this failure. This is one of the main deltas between your ZFS environment and other ZFS users, so this would allow us to isolate the memory corruption to the code handling 1MB blocksize.

          Bobijam,
          can you please make a new patch that checks the contents of niobuf_remote when it is first accessed by the OST (tgt_brw_read() and tgt_brw_write()) to verify that the contents are sane, and print out all the values under D_BUFFS debugging. If the values are incorrect a CERROR() should be printed and an -EPROTO error returned to the client, and we can debug this problem as a network corruption.

          This niobuf verification should be in a helper function that can also be called before the currently-failing LASSERT() checks are being handled (and elsewhere in the code if you think it is helpful), and those functions can return an EIO error to the caller rather than triggering the LASSERT. At that point the client should resend the BRW RPC due to brw_interpret()>osc_recoverable_error() and hopefully it will succeed on the second try.

          While I don't think this is a proper solution, it will at least tell us if the corruption is happening on the client and/or on the network, or in memory on the OSS, and it will potentially allow debugging to continue without the high frequency of OSS failures.

          adilger Andreas Dilger added a comment - Bobijam, can you please make a new patch that checks the contents of niobuf_remote when it is first accessed by the OST ( tgt_brw_read() and tgt_brw_write() ) to verify that the contents are sane, and print out all the values under D_BUFFS debugging. If the values are incorrect a CERROR() should be printed and an -EPROTO error returned to the client, and we can debug this problem as a network corruption. This niobuf verification should be in a helper function that can also be called before the currently-failing LASSERT() checks are being handled (and elsewhere in the code if you think it is helpful), and those functions can return an EIO error to the caller rather than triggering the LASSERT. At that point the client should resend the BRW RPC due to brw_interpret() >osc_recoverable_error() and hopefully it will succeed on the second try. While I don't think this is a proper solution, it will at least tell us if the corruption is happening on the client and/or on the network, or in memory on the OSS, and it will potentially allow debugging to continue without the high frequency of OSS failures.

          Andreas, yes, we're using 1MB block sizes on the ZFS datasets that handle the OSTs.

          rpwagner Rick Wagner (Inactive) added a comment - Andreas, yes, we're using 1MB block sizes on the ZFS datasets that handle the OSTs.

          We're using the f1512ee61e commit from master ZFS branch (large block support). It's later than 0.6.4.1, and I had problems running with latest master version.

          dimm Dmitry Mishin (Inactive) added a comment - We're using the f1512ee61e commit from master ZFS branch (large block support). It's later than 0.6.4.1, and I had problems running with latest master version.

          In the boot log I see you are using ZFS 0.6.4. It looks like there may be fixes in 0.6.4.1 and 0.6.4.2 that may be helpful in this case and LU-6607 (reducing CPU usage on metadata intensive workloads):
          https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.1

          Fixed io-spare.sh script for ZED.
          Fixed multiple deadlocks which might occur when reclaiming memory.
          Fixed excessive CPU usage for meta data heavy workloads when reclaiming the ARC.

          https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2

          Fix panic due to corrupt nvlist when running utilities zfsonlinux/zfs#3335
          Fix hard lockup due to infinite loop in zfs_zget() zfsonlinux/zfs#3349
          Fix panic on unmount due to iput taskq zfsonlinux/zfs#3281
          Improve metadata shrinker performance on pre-3.1 kernels zfsonlinux/zfs#3501
          Linux 4.1 compat: use read_iter() / write_iter()
          Linux 3.12 compat: NUMA-aware per-superblock shrinker
          Fix spurious hung task watchdog stack traces zfsonlinux/zfs#3402
          Fix module loading in zfs import systemd service zfsonlinux/zfs#3440
          Fix intermittent libzfs_init() failure to open /dev/zfs zfsonlinux/zfs#2556

          Also, just to confirm - are these OSS nodes running with 1MB ZFS blocksize?

          adilger Andreas Dilger added a comment - In the boot log I see you are using ZFS 0.6.4. It looks like there may be fixes in 0.6.4.1 and 0.6.4.2 that may be helpful in this case and LU-6607 (reducing CPU usage on metadata intensive workloads): https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.1 Fixed io-spare.sh script for ZED. Fixed multiple deadlocks which might occur when reclaiming memory. Fixed excessive CPU usage for meta data heavy workloads when reclaiming the ARC. https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.4.2 Fix panic due to corrupt nvlist when running utilities zfsonlinux/zfs#3335 Fix hard lockup due to infinite loop in zfs_zget() zfsonlinux/zfs#3349 Fix panic on unmount due to iput taskq zfsonlinux/zfs#3281 Improve metadata shrinker performance on pre-3.1 kernels zfsonlinux/zfs#3501 Linux 4.1 compat: use read_iter() / write_iter() Linux 3.12 compat: NUMA-aware per-superblock shrinker Fix spurious hung task watchdog stack traces zfsonlinux/zfs#3402 Fix module loading in zfs import systemd service zfsonlinux/zfs#3440 Fix intermittent libzfs_init() failure to open /dev/zfs zfsonlinux/zfs#2556 Also, just to confirm - are these OSS nodes running with 1MB ZFS blocksize?

          If the OSS is crashing repeatedly after startup, it may mean that the bad data is arriving from the client during replay and is not being verified properly? Are there checks up in the ost layer to verify the niobuf_remote contains valid data before it is used in the OSD? It may be that the corruption is happening on the network or on the client.

          adilger Andreas Dilger added a comment - If the OSS is crashing repeatedly after startup, it may mean that the bad data is arriving from the client during replay and is not being verified properly? Are there checks up in the ost layer to verify the niobuf_remote contains valid data before it is used in the OSD? It may be that the corruption is happening on the network or on the client.
          bobijam Zhenyu Xu added a comment -

          1. Unfortunately we don't know the reason of the LBUG, we just saw that some IO has invalid page length, and it seems there is some specific memory corruption there, so the debug patch is try to print out IO pages information if abnormal happens.

          2. You can setup kdump to capture the kernel crash dump as http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes guides.

          bobijam Zhenyu Xu added a comment - 1. Unfortunately we don't know the reason of the LBUG, we just saw that some IO has invalid page length, and it seems there is some specific memory corruption there, so the debug patch is try to print out IO pages information if abnormal happens. 2. You can setup kdump to capture the kernel crash dump as http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes guides.

          Hi Zhenyu,

          We will try to apply the new debug patch. But since this is our production file-system, we will have to plan it.
          We hit this LBUG randomly depend on how the file-system is used. For example we just had 7 hits during last 24hrs,
          after about 12 days quiet period.

          So I have 2 questions:

          1) by any chance you know a way to reproduce the LBUG? The reason I ask is that we have a non-production file-system that
          we may use for this debugging.
          2) about collecting core dump, do you mean the core dump of specific lustre processes or the entire system? When LBUG is hit,
          the OSS servers panic. I want to make sure we setup so that core dump can be saved properly.

          I like to remind everyone that this LBUG event usually occurs 2 or 3 consecutively, meaning OSS crash, standing OSS up,
          crash again, and standing OSS up again, and the third time, before it stabilized. Here are examples we had for the last 24hr.
          Please notice the server names and time stamps:

          [root@oasis-panda log]# grep -i lbug messages
          Aug 11 12:32:47 panda-oss-25-4 kernel: LustreError: 10267:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
          Aug 11 12:32:47 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
          Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
          Aug 11 13:29:54 panda-oss-25-4 kernel: LustreError: 10206:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 11 13:29:54 panda-oss-25-4 kernel: [<ffffffffa0927dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
          Aug 11 13:29:55 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG
          Aug 11 13:29:55 panda-oss-25-4 kernel: [<ffffffffa0927e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
          Aug 11 13:36:50 panda-oss-24-4 kernel: LustreError: 11848:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 11 13:45:55 panda-oss-24-4 kernel: LustreError: 10229:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 11 13:45:55 panda-oss-24-4 kernel: [<ffffffffa0923dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
          Aug 12 08:09:43 panda-oss-25-3 kernel: LustreError: 11728:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 12 09:48:28 panda-oss-25-3 kernel: LustreError: 11232:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
          Aug 12 09:48:29 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG
          Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
          Aug 12 09:52:56 panda-oss-25-3 kernel: LustreError: 10548:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG
          Aug 12 09:52:56 panda-oss-25-3 kernel: [<ffffffffa0916dd7>] lbug_with_loc+0x47/0xc0 [libcfs]
          Aug 12 09:52:58 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG
          Aug 12 09:52:58 panda-oss-25-3 kernel: [<ffffffffa0916e3b>] lbug_with_loc+0xab/0xc0 [libcfs]
          [root@oasis-panda log]#

          thank,
          Haisong

          haisong Haisong Cai (Inactive) added a comment - Hi Zhenyu, We will try to apply the new debug patch. But since this is our production file-system, we will have to plan it. We hit this LBUG randomly depend on how the file-system is used. For example we just had 7 hits during last 24hrs, after about 12 days quiet period. So I have 2 questions: 1) by any chance you know a way to reproduce the LBUG? The reason I ask is that we have a non-production file-system that we may use for this debugging. 2) about collecting core dump, do you mean the core dump of specific lustre processes or the entire system? When LBUG is hit, the OSS servers panic. I want to make sure we setup so that core dump can be saved properly. I like to remind everyone that this LBUG event usually occurs 2 or 3 consecutively, meaning OSS crash, standing OSS up, crash again, and standing OSS up again, and the third time, before it stabilized. Here are examples we had for the last 24hr. Please notice the server names and time stamps: [root@oasis-panda log] # grep -i lbug messages Aug 11 12:32:47 panda-oss-25-4 kernel: LustreError: 10267:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920dd7>] lbug_with_loc+0x47/0xc0 [libcfs] Aug 11 12:32:47 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG Aug 11 12:32:47 panda-oss-25-4 kernel: [<ffffffffa0920e3b>] lbug_with_loc+0xab/0xc0 [libcfs] Aug 11 13:29:54 panda-oss-25-4 kernel: LustreError: 10206:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 11 13:29:54 panda-oss-25-4 kernel: [<ffffffffa0927dd7>] lbug_with_loc+0x47/0xc0 [libcfs] Aug 11 13:29:55 panda-oss-25-4 kernel: Kernel panic - not syncing: LBUG Aug 11 13:29:55 panda-oss-25-4 kernel: [<ffffffffa0927e3b>] lbug_with_loc+0xab/0xc0 [libcfs] Aug 11 13:36:50 panda-oss-24-4 kernel: LustreError: 11848:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 11 13:45:55 panda-oss-24-4 kernel: LustreError: 10229:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 11 13:45:55 panda-oss-24-4 kernel: [<ffffffffa0923dd7>] lbug_with_loc+0x47/0xc0 [libcfs] Aug 12 08:09:43 panda-oss-25-3 kernel: LustreError: 11728:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 12 09:48:28 panda-oss-25-3 kernel: LustreError: 11232:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919dd7>] lbug_with_loc+0x47/0xc0 [libcfs] Aug 12 09:48:29 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG Aug 12 09:48:29 panda-oss-25-3 kernel: [<ffffffffa0919e3b>] lbug_with_loc+0xab/0xc0 [libcfs] Aug 12 09:52:56 panda-oss-25-3 kernel: LustreError: 10548:0:(client.c:211:__ptlrpc_prep_bulk_page()) LBUG Aug 12 09:52:56 panda-oss-25-3 kernel: [<ffffffffa0916dd7>] lbug_with_loc+0x47/0xc0 [libcfs] Aug 12 09:52:58 panda-oss-25-3 kernel: Kernel panic - not syncing: LBUG Aug 12 09:52:58 panda-oss-25-3 kernel: [<ffffffffa0916e3b>] lbug_with_loc+0xab/0xc0 [libcfs] [root@oasis-panda log] # thank, Haisong
          bobijam Zhenyu Xu added a comment -

          Hi Haisong,

          http://review.whamcloud.com/#/c/14926/ has been updated to include more debug information to catch the LBUG, would you please apply it and re-hit the issue and collect core dump? Please add "ha" debug before re-run the test, as

          lctl set_param debug="+ha" 
          bobijam Zhenyu Xu added a comment - Hi Haisong, http://review.whamcloud.com/#/c/14926/ has been updated to include more debug information to catch the LBUG, would you please apply it and re-hit the issue and collect core dump? Please add "ha" debug before re-run the test, as lctl set_param debug= "+ha"

          People

            tappro Mikhail Pershin
            haisong Haisong Cai (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: