[LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
Environment:
Servers and clients: 2.5.4-11chaos-11chaos--PRISTINE-2.6.32-573.7.1.1chaos.ch5.4.x86_64
ZFS back end

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have some production apps and rsync processes failing writes with ENOSPC errors on the ZFS backed FS only. It is currently at ~79%. There are no server side errors, -28 errors as above appeare in the client logs.

I see that ~~LU-3522~~ and ~~LU-2049~~ may have a bearing on this issue, is there a 2.5 backport or equivalent fix available?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-7510-lbug.txt
14 kB
27/Apr/16 4:36 PM
zfs.lfs-out.12.02
10 kB
02/Dec/15 9:06 PM
zfs.tot_granted.12.02
3 kB
02/Dec/15 9:06 PM

Issue Links

is related to

LU-8007 Kernel: LustreError: 191208:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 82782 of inode ffff8803a8fd06b8 failed -28

Resolved

LU-2049 add support for OBD_CONNECT_GRANT_PARAM

Resolved

Activity

[LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

Ruth Klundt (Inactive) added a comment - 08/Apr/16 4:17 PM

We are running the workaround to release grant on the clients from the epilog (ie after each job), just to proactively keep that under control. We have not seen enospc errs in logs again, and usage has hit the 80% mark several times since.

Ruth Klundt (Inactive) added a comment - 08/Apr/16 4:17 PM We are running the workaround to release grant on the clients from the epilog (ie after each job), just to proactively keep that under control. We have not seen enospc errs in logs again, and usage has hit the 80% mark several times since.

Cameron Harr added a comment - 07/Apr/16 11:42 PM

John,
FYI...We've been hitting this at LLNL the last week or so. I'll note it on ~~LU-2049~~ as well.

Cameron Harr added a comment - 07/Apr/16 11:42 PM John, FYI...We've been hitting this at LLNL the last week or so. I'll note it on LU-2049 as well.

John Fuchs-Chesney (Inactive) added a comment - 17/Dec/15 5:52 PM

We are resolving this as a duplicate.

Ruth – if the problem recurs and you need more help please either ask us to reopen this ticket, or open a new one, as you prefer.

Thanks,
~ jfc.

John Fuchs-Chesney (Inactive) added a comment - 17/Dec/15 5:52 PM We are resolving this as a duplicate. Ruth – if the problem recurs and you need more help please either ask us to reopen this ticket, or open a new one, as you prefer. Thanks, ~ jfc.

Ruth Klundt (Inactive) added a comment - 03/Dec/15 10:33 PM

thanks, much appreciated. I'll keep an eye on those messages, they have not resumed since the usage went down.

Ruth Klundt (Inactive) added a comment - 03/Dec/15 10:33 PM thanks, much appreciated. I'll keep an eye on those messages, they have not resumed since the usage went down.

Andreas Dilger added a comment - 03/Dec/15 7:11 PM

Ruth, the cur_grant_bytes command to release grants is something that you can try as a workaround if the -ENOSPC errors are being hit again. It doesn't hurt to run this now, or occasionally, though it may cause a very brief hiccup in IO performance as the grant is released. The main reason this isn't useful to do (much) in advance of the problem is that this command asks clients try to return their grant to the server, but if the server isn't low on space it will just return the grant back to the client.

The real fix for this problem is indeed the ~~LU-2049~~ patch. The reason you see this problem when LLNL does not is that you have many more clients connected directly to the filesystem (6500 vs 768) and their OSTs are 72TB vs 30TB so they wouldn't hit this until they reach 99% full.

We are working to get the ~~LU-2049~~ patch landed to resolve this issue permanently.

Andreas Dilger added a comment - 03/Dec/15 7:11 PM Ruth, the cur_grant_bytes command to release grants is something that you can try as a workaround if the -ENOSPC errors are being hit again. It doesn't hurt to run this now, or occasionally, though it may cause a very brief hiccup in IO performance as the grant is released. The main reason this isn't useful to do (much) in advance of the problem is that this command asks clients try to return their grant to the server, but if the server isn't low on space it will just return the grant back to the client. The real fix for this problem is indeed the LU-2049 patch. The reason you see this problem when LLNL does not is that you have many more clients connected directly to the filesystem (6500 vs 768) and their OSTs are 72TB vs 30TB so they wouldn't hit this until they reach 99% full. We are working to get the LU-2049 patch landed to resolve this issue permanently.

Ruth Klundt (Inactive) added a comment - 02/Dec/15 9:06 PM

The max_pages_per_rpc value is default of 256, and the zfs recordsize is 128K. We have 3 OSTs on each OSS rather than just one. We have ~6500 clients mounting the file system.

We requested that some of the heavy users clean up, so the FS is at 75% now. Also moved a couple of affected users to the other (ldiskfs) file system.

No messages so far today. I will go ahead and release some grant if you think it's still necessary or beneficial.

I was guessing a combination of the bug + heavy user activity + high fs usage may have triggered this. Our FS usage tends to run high around here.

Ruth Klundt (Inactive) added a comment - 02/Dec/15 9:06 PM The max_pages_per_rpc value is default of 256, and the zfs recordsize is 128K. We have 3 OSTs on each OSS rather than just one. We have ~6500 clients mounting the file system. We requested that some of the heavy users clean up, so the FS is at 75% now. Also moved a couple of affected users to the other (ldiskfs) file system. No messages so far today. I will go ahead and release some grant if you think it's still necessary or beneficial. I was guessing a combination of the bug + heavy user activity + high fs usage may have triggered this. Our FS usage tends to run high around here.

Andreas Dilger added a comment - 02/Dec/15 7:56 PM - edited

Ruth, could you please post the output of "lfs df" and "lfs df -i" on your filesystem(s). On the OSS nodes, could you please collect "lctl get_param obdfilter.*.tot_granted" to see if this is the actual cause of the ENOSPC errors. Also, how many clients are connected to the filesystem?

One potential workaround is to release some of the grant from the clients using "lctl set_param osc.*.cur_grant_bytes=2M" and then check lctl get_param obdfilter.*.tot_granted" on the OSS nodes again to see if the total grant space has been reduced.

Have you modified the client's maximum RPC size (via lctl set_param osc.*.max_pages_per_rpc=4M, e.g. to have 4MB RPC size), or the ZFS maximum blocksize (via zfs set recordsize=1048576 lustre/lustre-OST0000 or similar)? That will aggravate this problem until the patch from ~~LU-3522~~ is landed.

Andreas Dilger added a comment - 02/Dec/15 7:56 PM - edited Ruth, could you please post the output of " lfs df " and " lfs df -i " on your filesystem(s). On the OSS nodes, could you please collect " lctl get_param obdfilter.*.tot_granted " to see if this is the actual cause of the ENOSPC errors. Also, how many clients are connected to the filesystem? One potential workaround is to release some of the grant from the clients using " lctl set_param osc.*.cur_grant_bytes=2M " and then check lctl get_param obdfilter.*.tot_granted " on the OSS nodes again to see if the total grant space has been reduced. Have you modified the client's maximum RPC size (via lctl set_param osc.*.max_pages_per_rpc=4M , e.g. to have 4MB RPC size), or the ZFS maximum blocksize (via zfs set recordsize=1048576 lustre/lustre-OST0000 or similar)? That will aggravate this problem until the patch from LU-3522 is landed.

People

Assignee:: Nathaniel Clark

Reporter:: Ruth Klundt (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Dec/15 9:13 PM

Updated:: 08/Jun/16 10:23 PM

Resolved:: 08/Jun/16 10:23 PM