[LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.5.3
Labels:
- llnl
Environment:
Servers and clients: 2.5.4-11chaos-11chaos--PRISTINE-2.6.32-573.7.1.1chaos.ch5.4.x86_64
ZFS back end

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We have some production apps and rsync processes failing writes with ENOSPC errors on the ZFS backed FS only. It is currently at ~79%. There are no server side errors, -28 errors as above appeare in the client logs.

I see that ~~LU-3522~~ and ~~LU-2049~~ may have a bearing on this issue, is there a 2.5 backport or equivalent fix available?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lu-7510-lbug.txt
14 kB
27/Apr/16 4:36 PM
zfs.lfs-out.12.02
10 kB
02/Dec/15 9:06 PM
zfs.tot_granted.12.02
3 kB
02/Dec/15 9:06 PM

Issue Links

is related to

LU-8007 Kernel: LustreError: 191208:0:(vvp_io.c:1086:vvp_io_commit_write()) Write page 82782 of inode ffff8803a8fd06b8 failed -28

Resolved

LU-2049 add support for OBD_CONNECT_GRANT_PARAM

Resolved

Activity

[LU-7510] (vvp_io.c:1088:vvp_io_commit_write()) Write page 962977 of inode ffff880fbea44b78 failed -28

Ruth Klundt (Inactive) added a comment - 08/Jun/16 8:14 PM

The file system usage has been reduced to ~70%, and we haven't seen -28 issues or LBUGs since then.

You can close this one, we'll consider the fix for -28 issues to be upgrade to 2.8 lustre on the servers at some point in the future.

If the LBUG re-occurs I'll open a new ticket.

Thanks,
Ruth

Ruth Klundt (Inactive) added a comment - 08/Jun/16 8:14 PM The file system usage has been reduced to ~70%, and we haven't seen -28 issues or LBUGs since then. You can close this one, we'll consider the fix for -28 issues to be upgrade to 2.8 lustre on the servers at some point in the future. If the LBUG re-occurs I'll open a new ticket. Thanks, Ruth

Nathaniel Clark added a comment - 26/May/16 8:11 PM

The LBUG in question hasn't been changed, though the grant code has been reworked (a la ~~LU-2049~~) upstream. The negative grant resulting in LBUG should be separate bug, though it's probably 2.5 only.

Nathaniel Clark added a comment - 26/May/16 8:11 PM The LBUG in question hasn't been changed, though the grant code has been reworked (a la LU-2049 ) upstream. The negative grant resulting in LBUG should be separate bug, though it's probably 2.5 only.

Ruth Klundt (Inactive) added a comment - 13/May/16 2:30 PM

And a specific question, is the LBUG likely addressed by changes upstream or should this be a separate ticket?

Ruth Klundt (Inactive) added a comment - 13/May/16 2:30 PM And a specific question, is the LBUG likely addressed by changes upstream or should this be a separate ticket?

Ruth Klundt (Inactive) added a comment - 13/May/16 2:29 PM

Nearly all OSS nodes on this file system became inaccessible yesterday, 3 of them showed the LBUG at ofd_grant.c:352:ofd_grant_incoming with negative grant values. I disabled the automated grant release workaround in case it is related to this occurence. The OSTs are 77-79% full at the moment. After that another OSS went down with the same LBUG.

This coincides with the addition of a new cluster, but we haven't done any I/O from it so far, just mounting. Any advice/thoughts?

Ruth Klundt (Inactive) added a comment - 13/May/16 2:29 PM Nearly all OSS nodes on this file system became inaccessible yesterday, 3 of them showed the LBUG at ofd_grant.c:352:ofd_grant_incoming with negative grant values. I disabled the automated grant release workaround in case it is related to this occurence. The OSTs are 77-79% full at the moment. After that another OSS went down with the same LBUG. This coincides with the addition of a new cluster, but we haven't done any I/O from it so far, just mounting. Any advice/thoughts?

Ruth Klundt (Inactive) added a comment - 02/May/16 8:53 PM

Each of the osts have shown a couple of decreases, in the 3.8-3.9 T range.

Ruth Klundt (Inactive) added a comment - 02/May/16 8:53 PM Each of the osts have shown a couple of decreases, in the 3.8-3.9 T range.

Ruth Klundt (Inactive) added a comment - 27/Apr/16 4:38 PM

after deactivating the osts on that node, the rate of increase is slower, but it still is much larger than all the others and not decreasing so far at about ~3.7T.

Ruth Klundt (Inactive) added a comment - 27/Apr/16 4:38 PM after deactivating the osts on that node, the rate of increase is slower, but it still is much larger than all the others and not decreasing so far at about ~3.7T.

Ruth Klundt (Inactive) added a comment - 27/Apr/16 4:36 PM

There is nothing prior to the lbug. Here are the traces.

The ofd code at least does not differ between 11chaos and 12chaos versions afaics.

Ruth Klundt (Inactive) added a comment - 27/Apr/16 4:36 PM There is nothing prior to the lbug. Here are the traces. The ofd code at least does not differ between 11chaos and 12chaos versions afaics.

Nathaniel Clark added a comment - 26/Apr/16 9:46 PM

FYI: I don't have an exact version for 2.5.4-11chaos (12chaos and 4chaos are tagged in our system, so I have a good idea).

Do you have any logging leading up to the LBUG, by any chance?

Nathaniel Clark added a comment - 26/Apr/16 9:46 PM FYI: I don't have an exact version for 2.5.4-11chaos (12chaos and 4chaos are tagged in our system, so I have a good idea). Do you have any logging leading up to the LBUG, by any chance?

Ruth Klundt (Inactive) added a comment - 21/Apr/16 3:00 PM

It turns out that the grant release does work, even on the problem node, once the grant on the problem server reaches ~4.9T. It decreased to ~4.0T over the course of a day before the lbug on 2 different targets. The other servers respond to grant release at levels of as low as 1.4T. The usage levels are similar, all osts between 75-80%. The only difference I can find is that the other zpools last item in history is activation of compression back in march. So this one server was rebooted after that compression activation, all the rest were not. Wondering if the size computation is affected by whether compression is on or off? All zpools are reporting 1.03-1.05 ratios.

Ruth Klundt (Inactive) added a comment - 21/Apr/16 3:00 PM It turns out that the grant release does work, even on the problem node, once the grant on the problem server reaches ~4.9T. It decreased to ~4.0T over the course of a day before the lbug on 2 different targets. The other servers respond to grant release at levels of as low as 1.4T. The usage levels are similar, all osts between 75-80%. The only difference I can find is that the other zpools last item in history is activation of compression back in march. So this one server was rebooted after that compression activation, all the rest were not. Wondering if the size computation is affected by whether compression is on or off? All zpools are reporting 1.03-1.05 ratios.

Ruth Klundt (Inactive) added a comment - 15/Apr/16 3:39 PM

Yesterday we had a server go down with this LBUG:

LustreError: 8117:0:(ofd_grant.c:352:ofd_grant_incoming()) fscratch-OST001d: cli 8c1795e2-8806-4e65-5865-4e42489eac9b/ffff8807dee68400 dirty 33554432 pend 0 grant -54657024
LustreError: 8117:0:(ofd_grant.c:354:ofd_grant_incoming()) LBUG

The client side grant release doesn't seem to be taking effect on that node. The values of tot_granted on the server side were increased to ~5x10^12 on that node only, all the other nodes had values ~2x10^12.

This node has gone down several times in the last few weeks, this is the first time we got any log messages before it died. It seems we should also deactivate those OSTs, and increase the priority of upgrading the servers - although it appears a 2.5 upgrade path is not yet available (looking at ~~LU-2049~~).

But I'm puzzled at why the grant release doesn't work for just that node. All the others have not been re-booted during this time period since we started the workaround.

Ruth Klundt (Inactive) added a comment - 15/Apr/16 3:39 PM Yesterday we had a server go down with this LBUG: LustreError: 8117:0:(ofd_grant.c:352:ofd_grant_incoming()) fscratch-OST001d: cli 8c1795e2-8806-4e65-5865-4e42489eac9b/ffff8807dee68400 dirty 33554432 pend 0 grant -54657024 LustreError: 8117:0:(ofd_grant.c:354:ofd_grant_incoming()) LBUG The client side grant release doesn't seem to be taking effect on that node. The values of tot_granted on the server side were increased to ~5x10^12 on that node only, all the other nodes had values ~2x10^12. This node has gone down several times in the last few weeks, this is the first time we got any log messages before it died. It seems we should also deactivate those OSTs, and increase the priority of upgrading the servers - although it appears a 2.5 upgrade path is not yet available (looking at LU-2049 ). But I'm puzzled at why the grant release doesn't work for just that node. All the others have not been re-booted during this time period since we started the workaround.

Cameron Harr added a comment - 11/Apr/16 5:03 PM

Ruth,
Thanks for letting us know you went down the grant release route. I had noticed that 32 of our 80 OSTs were ~90% full (the others ~ 65%), so I deactivated those 32 fuller OSTs and that seems to have resolved the problem for now.

Cameron Harr added a comment - 11/Apr/16 5:03 PM Ruth, Thanks for letting us know you went down the grant release route. I had noticed that 32 of our 80 OSTs were ~90% full (the others ~ 65%), so I deactivated those 32 fuller OSTs and that seems to have resolved the problem for now.

People

Assignee:: Nathaniel Clark

Reporter:: Ruth Klundt (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Dec/15 9:13 PM

Updated:: 08/Jun/16 10:23 PM

Resolved:: 08/Jun/16 10:23 PM