Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7510

(vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

Details

    • Bug
    • Resolution: Done
    • Major
    • None
    • Lustre 2.5.3
    • Servers and clients: 2.5.4-11chaos-11chaos--PRISTINE-2.6.32-573.7.1.1chaos.ch5.4.x86_64
      ZFS back end
    • 3
    • 9223372036854775807

    Description

      We have some production apps and rsync processes failing writes with ENOSPC errors on the ZFS backed FS only. It is currently at ~79%. There are no server side errors, -28 errors as above appeare in the client logs.

      I see that LU-3522 and LU-2049 may have a bearing on this issue, is there a 2.5 backport or equivalent fix available?

      Attachments

        1. lu-7510-lbug.txt
          14 kB
        2. zfs.lfs-out.12.02
          10 kB
        3. zfs.tot_granted.12.02
          3 kB

        Issue Links

          Activity

            [LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

            We are running the workaround to release grant on the clients from the epilog (ie after each job), just to proactively keep that under control. We have not seen enospc errs in logs again, and usage has hit the 80% mark several times since.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - We are running the workaround to release grant on the clients from the epilog (ie after each job), just to proactively keep that under control. We have not seen enospc errs in logs again, and usage has hit the 80% mark several times since.

            John,
            FYI...We've been hitting this at LLNL the last week or so. I'll note it on LU-2049 as well.

            charr Cameron Harr added a comment - John, FYI...We've been hitting this at LLNL the last week or so. I'll note it on LU-2049 as well.

            We are resolving this as a duplicate.

            Ruth – if the problem recurs and you need more help please either ask us to reopen this ticket, or open a new one, as you prefer.

            Thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - We are resolving this as a duplicate. Ruth – if the problem recurs and you need more help please either ask us to reopen this ticket, or open a new one, as you prefer. Thanks, ~ jfc.

            thanks, much appreciated. I'll keep an eye on those messages, they have not resumed since the usage went down.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - thanks, much appreciated. I'll keep an eye on those messages, they have not resumed since the usage went down.

            Ruth, the cur_grant_bytes command to release grants is something that you can try as a workaround if the -ENOSPC errors are being hit again. It doesn't hurt to run this now, or occasionally, though it may cause a very brief hiccup in IO performance as the grant is released. The main reason this isn't useful to do (much) in advance of the problem is that this command asks clients try to return their grant to the server, but if the server isn't low on space it will just return the grant back to the client.

            The real fix for this problem is indeed the LU-2049 patch. The reason you see this problem when LLNL does not is that you have many more clients connected directly to the filesystem (6500 vs 768) and their OSTs are 72TB vs 30TB so they wouldn't hit this until they reach 99% full.

            We are working to get the LU-2049 patch landed to resolve this issue permanently.

            adilger Andreas Dilger added a comment - Ruth, the cur_grant_bytes command to release grants is something that you can try as a workaround if the -ENOSPC errors are being hit again. It doesn't hurt to run this now, or occasionally, though it may cause a very brief hiccup in IO performance as the grant is released. The main reason this isn't useful to do (much) in advance of the problem is that this command asks clients try to return their grant to the server, but if the server isn't low on space it will just return the grant back to the client. The real fix for this problem is indeed the LU-2049 patch. The reason you see this problem when LLNL does not is that you have many more clients connected directly to the filesystem (6500 vs 768) and their OSTs are 72TB vs 30TB so they wouldn't hit this until they reach 99% full. We are working to get the LU-2049 patch landed to resolve this issue permanently.

            The max_pages_per_rpc value is default of 256, and the zfs recordsize is 128K. We have 3 OSTs on each OSS rather than just one. We have ~6500 clients mounting the file system.

            We requested that some of the heavy users clean up, so the FS is at 75% now. Also moved a couple of affected users to the other (ldiskfs) file system.

            No messages so far today. I will go ahead and release some grant if you think it's still necessary or beneficial.

            I was guessing a combination of the bug + heavy user activity + high fs usage may have triggered this. Our FS usage tends to run high around here.

            ruth.klundt@gmail.com Ruth Klundt (Inactive) added a comment - The max_pages_per_rpc value is default of 256, and the zfs recordsize is 128K. We have 3 OSTs on each OSS rather than just one. We have ~6500 clients mounting the file system. We requested that some of the heavy users clean up, so the FS is at 75% now. Also moved a couple of affected users to the other (ldiskfs) file system. No messages so far today. I will go ahead and release some grant if you think it's still necessary or beneficial. I was guessing a combination of the bug + heavy user activity + high fs usage may have triggered this. Our FS usage tends to run high around here.
            adilger Andreas Dilger added a comment - - edited

            Ruth, could you please post the output of "lfs df" and "lfs df -i" on your filesystem(s). On the OSS nodes, could you please collect "lctl get_param obdfilter.*.tot_granted" to see if this is the actual cause of the ENOSPC errors. Also, how many clients are connected to the filesystem?

            One potential workaround is to release some of the grant from the clients using "lctl set_param osc.*.cur_grant_bytes=2M" and then check lctl get_param obdfilter.*.tot_granted" on the OSS nodes again to see if the total grant space has been reduced.

            Have you modified the client's maximum RPC size (via lctl set_param osc.*.max_pages_per_rpc=4M, e.g. to have 4MB RPC size), or the ZFS maximum blocksize (via zfs set recordsize=1048576 lustre/lustre-OST0000 or similar)? That will aggravate this problem until the patch from LU-3522 is landed.

            adilger Andreas Dilger added a comment - - edited Ruth, could you please post the output of " lfs df " and " lfs df -i " on your filesystem(s). On the OSS nodes, could you please collect " lctl get_param obdfilter.*.tot_granted " to see if this is the actual cause of the ENOSPC errors. Also, how many clients are connected to the filesystem? One potential workaround is to release some of the grant from the clients using " lctl set_param osc.*.cur_grant_bytes=2M " and then check lctl get_param obdfilter.*.tot_granted " on the OSS nodes again to see if the total grant space has been reduced. Have you modified the client's maximum RPC size (via lctl set_param osc.*.max_pages_per_rpc=4M , e.g. to have 4MB RPC size), or the ZFS maximum blocksize (via zfs set recordsize=1048576 lustre/lustre-OST0000 or similar)? That will aggravate this problem until the patch from LU-3522 is landed.

            People

              utopiabound Nathaniel Clark
              ruth.klundt@gmail.com Ruth Klundt (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: