Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7510

(vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • Lustre 2.5.3
    • Servers and clients: 2.5.4-11chaos-11chaos--PRISTINE-2.6.32-573.7.1.1chaos.ch5.4.x86_64
      ZFS back end
    • 3
    • 9223372036854775807

    Description

      We have some production apps and rsync processes failing writes with ENOSPC errors on the ZFS backed FS only. It is currently at ~79%. There are no server side errors, -28 errors as above appeare in the client logs.

      I see that LU-3522 and LU-2049 may have a bearing on this issue, is there a 2.5 backport or equivalent fix available?

      Attachments

        1. lu-7510-lbug.txt
          14 kB
        2. zfs.lfs-out.12.02
          10 kB
        3. zfs.tot_granted.12.02
          3 kB

        Issue Links

          Activity

            [LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

            Thanks Ruth.

            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Thanks Ruth. ~ jfc.

            The file system usage has been reduced to ~70%, and we haven't seen -28 issues or LBUGs since then.

            You can close this one, we'll consider the fix for -28 issues to be upgrade to 2.8 lustre on the servers at some point in the future.

            If the LBUG re-occurs I'll open a new ticket.

            Thanks,
            Ruth

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - The file system usage has been reduced to ~70%, and we haven't seen -28 issues or LBUGs since then. You can close this one, we'll consider the fix for -28 issues to be upgrade to 2.8 lustre on the servers at some point in the future. If the LBUG re-occurs I'll open a new ticket. Thanks, Ruth

            The LBUG in question hasn't been changed, though the grant code has been reworked (a la LU-2049) upstream. The negative grant resulting in LBUG should be separate bug, though it's probably 2.5 only.

            utopiabound Nathaniel Clark added a comment - The LBUG in question hasn't been changed, though the grant code has been reworked (a la LU-2049 ) upstream. The negative grant resulting in LBUG should be separate bug, though it's probably 2.5 only.

            And a specific question, is the LBUG likely addressed by changes upstream or should this be a separate ticket?

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - And a specific question, is the LBUG likely addressed by changes upstream or should this be a separate ticket?

            Nearly all OSS nodes on this file system became inaccessible yesterday, 3 of them showed the LBUG at ofd_grant.c:352:ofd_grant_incoming with negative grant values. I disabled the automated grant release workaround in case it is related to this occurence. The OSTs are 77-79% full at the moment. After that another OSS went down with the same LBUG.

            This coincides with the addition of a new cluster, but we haven't done any I/O from it so far, just mounting. Any advice/thoughts?

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Nearly all OSS nodes on this file system became inaccessible yesterday, 3 of them showed the LBUG at ofd_grant.c:352:ofd_grant_incoming with negative grant values. I disabled the automated grant release workaround in case it is related to this occurence. The OSTs are 77-79% full at the moment. After that another OSS went down with the same LBUG. This coincides with the addition of a new cluster, but we haven't done any I/O from it so far, just mounting. Any advice/thoughts?

            Each of the osts have shown a couple of decreases, in the 3.8-3.9 T range.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - Each of the osts have shown a couple of decreases, in the 3.8-3.9 T range.

            after deactivating the osts on that node, the rate of increase is slower, but it still is much larger than all the others and not decreasing so far at about ~3.7T.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - after deactivating the osts on that node, the rate of increase is slower, but it still is much larger than all the others and not decreasing so far at about ~3.7T.

            There is nothing prior to the lbug. Here are the traces.

            The ofd code at least does not differ between 11chaos and 12chaos versions afaics.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - There is nothing prior to the lbug. Here are the traces. The ofd code at least does not differ between 11chaos and 12chaos versions afaics.

            FYI: I don't have an exact version for 2.5.4-11chaos (12chaos and 4chaos are tagged in our system, so I have a good idea).

            Do you have any logging leading up to the LBUG, by any chance?

            utopiabound Nathaniel Clark added a comment - FYI: I don't have an exact version for 2.5.4-11chaos (12chaos and 4chaos are tagged in our system, so I have a good idea). Do you have any logging leading up to the LBUG, by any chance?

            It turns out that the grant release does work, even on the problem node, once the grant on the problem server reaches ~4.9T. It decreased to ~4.0T over the course of a day before the lbug on 2 different targets. The other servers respond to grant release at levels of as low as 1.4T. The usage levels are similar, all osts between 75-80%. The only difference I can find is that the other zpools last item in history is activation of compression back in march. So this one server was rebooted after that compression activation, all the rest were not. Wondering if the size computation is affected by whether compression is on or off? All zpools are reporting 1.03-1.05 ratios.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - It turns out that the grant release does work, even on the problem node, once the grant on the problem server reaches ~4.9T. It decreased to ~4.0T over the course of a day before the lbug on 2 different targets. The other servers respond to grant release at levels of as low as 1.4T. The usage levels are similar, all osts between 75-80%. The only difference I can find is that the other zpools last item in history is activation of compression back in march. So this one server was rebooted after that compression activation, all the rest were not. Wondering if the size computation is affected by whether compression is on or off? All zpools are reporting 1.03-1.05 ratios.

            People

              utopiabound Nathaniel Clark
              ruth.klundt@gmail.com Ruth Klundt (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: