Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.8.0
    • Lustre 2.7.0
    • 4
    • 9223372036854775807

    Description

      No sign or indication, ie lustre-log or error messages, OSS unexpectantly crash (please see console image).

      /var/log/messages is attached

      Attachments

        1. 23-6.png
          23-6.png
          47 kB
        2. log.28119.gz
          388 kB
        3. lustre-logs.tgz
          0.2 kB
        4. messages13
          271 kB
        5. panda-oss-23-6_messages
          1003 kB

        Activity

          [LU-6584] OSS hit LBUG and crash
          pjones Peter Jones added a comment -

          Fix landed for 2.8. We'll reopen if this issue still is hit on Hyperion. If there is still an issue at SDSC and it is not, as hoped, a duplicate of this issue then please open a new ticket to track that issue.

          pjones Peter Jones added a comment - Fix landed for 2.8. We'll reopen if this issue still is hit on Hyperion. If there is still an issue at SDSC and it is not, as hoped, a duplicate of this issue then please open a new ticket to track that issue.

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16685/
          Subject: LU-6584 osd: prevent int type overflow in osd_read_prep()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: efe3842c76b8041a048457779554ffa5ba76567d

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/16685/ Subject: LU-6584 osd: prevent int type overflow in osd_read_prep() Project: fs/lustre-release Branch: master Current Patch Set: Commit: efe3842c76b8041a048457779554ffa5ba76567d

          Rick, this particular issue existed in IO READ code path and doesn't related to LU-7106. I check OSD code quickly and didn't notice other similar issues at first glance.

          tappro Mikhail Pershin added a comment - Rick, this particular issue existed in IO READ code path and doesn't related to LU-7106 . I check OSD code quickly and didn't notice other similar issues at first glance.

          Yes, we're scheduling a PM and push this out. Could this patch be related to LU-7106? In other words could the current code create an error that propagates back to the client as ENOSPC even when there's capacity on the OST?

          rpwagner Rick Wagner (Inactive) added a comment - Yes, we're scheduling a PM and push this out. Could this patch be related to LU-7106 ? In other words could the current code create an error that propagates back to the client as ENOSPC even when there's capacity on the OST?
          pjones Peter Jones added a comment -

          Will SDSC be able to try this patch out to confirm whether it fixes the issues that they have been experiencing?

          pjones Peter Jones added a comment - Will SDSC be able to try this patch out to confirm whether it fixes the issues that they have been experiencing?

          It seems the reason of this issue is the int type overflow in lnb_rc. Instead of writing the (eof - file_offset) right into lnb_rc we have to check first it is not negative.

          tappro Mikhail Pershin added a comment - It seems the reason of this issue is the int type overflow in lnb_rc. Instead of writing the (eof - file_offset) right into lnb_rc we have to check first it is not negative.

          Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16685
          Subject: LU-6584 osd: prevent int type overflow in osd_read_prep()
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 687338302147dad5b09b964b8615a3b3adb78a7d

          gerrit Gerrit Updater added a comment - Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/16685 Subject: LU-6584 osd: prevent int type overflow in osd_read_prep() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 687338302147dad5b09b964b8615a3b3adb78a7d

          Hi Andreas, since our last update to the code tree based on http://review.whamcloud.com/#/c/14926/ we've been stable. It's possible that we've pulled in a bugfix along with the debugging patch although I couldn't point to a specific one.

          We are looking at ZFS 0.6.5 to get away from the unreleased version for ZFS we've had to run. I would probably do that along with another rebase to a later unpatched tag of Lustre, maybe once LU-4865 is included.

          On a related note, I think this issue could be removed from the 2.8 blocker list, since we started with patched versions of Lustre and ZFS.

          rpwagner Rick Wagner (Inactive) added a comment - Hi Andreas, since our last update to the code tree based on http://review.whamcloud.com/#/c/14926/ we've been stable. It's possible that we've pulled in a bugfix along with the debugging patch although I couldn't point to a specific one. We are looking at ZFS 0.6.5 to get away from the unreleased version for ZFS we've had to run. I would probably do that along with another rebase to a later unpatched tag of Lustre, maybe once LU-4865 is included. On a related note, I think this issue could be removed from the 2.8 blocker list, since we started with patched versions of Lustre and ZFS.

          Hi Rick, any news on this front? Have you looked into upgrading to ZFS 0.6.5 to get the native large block support? The patch http://review.whamcloud.com/15127 "LU-4865 zfs: grow block size by write pattern" should also help performance when dealing with files under 1MB in size.

          adilger Andreas Dilger added a comment - Hi Rick, any news on this front? Have you looked into upgrading to ZFS 0.6.5 to get the native large block support? The patch http://review.whamcloud.com/15127 " LU-4865 zfs: grow block size by write pattern" should also help performance when dealing with files under 1MB in size.

          We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing.

          Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before.

          rpwagner Rick Wagner (Inactive) added a comment - We've scheduled a maintenance window for Sep. 8 to roll out this latest patch after testing. Andreas, I'll consider changing the recordsize on some of the OSTs. The most likely scenario where we get solid information from this is if the LBUG is still hit on one of the OSSes with the changed setting. I am being a little cautious considering this since it will mean having a ZFS dataset with varying recordsizes. I don't believe the ZFS layer will care, but it's not something I've dealt with before.

          People

            tappro Mikhail Pershin
            haisong Haisong Cai (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: