[LU-8480] Server syslog: ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760 Created: 05/Aug/16 Updated: 15/Jun/17 Resolved: 29/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Nathan Dauchy (Inactive) | Assignee: | Jian Yu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Mostly CentOS-6.8, with 6.7 kernel 2.6.32_573.26.1.el6 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Error message seen in system log on at least one production server: May 20 18:38:58 nbp1-oss1 kernel: LustreError: 38115:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 98847872 != fo_tot_granted 100945024 Reproduced on test system, (running lustre-2.7.2-2nasS_mofed32v1.el67.20160517v2), by mounting just 8 clients, running IOR write from 1, IOR read from 1, mdtest on the other 6. Then upon unmount of all 8 clients simultaneously, get the messages on each (of the two) OSS nodes: Aug 5 08:06:36 service320 kernel: LustreError: 76476:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760 Aug 5 08:06:36 service323 kernel: LustreError: 76171:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760 What does this message mean, and is it cause for major concern? We are testing in preparation for upgrade to all remaining production file systems on 8/15, so need to know whether to proceed ASAP. |
| Comments |
| Comment by Andreas Dilger [ 08/Aug/16 ] |
|
It looks like the delta between the grant amounts is about 32MB, which is the amount of grant held by a single client after it has done some writes to the OST. It looks like there is a race window between traversing all of the client exports to accumulate their grant, and the running total that is kept for the whole OST when a client is being unmounted, and is possible to be triggered when multiple clients are being unmounted. This doesn't seem harmful in itself, and for systems with more than 100 clients this check is skipped because it slows down the unmount too much (it is O(n^2) to accumulate the per-export grants as each client unmounts). It looks like there may be a very simple fix, for which I can push a patch if you could give it a test. |
| Comment by Gerrit Updater [ 08/Aug/16 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/21813 |
| Comment by Nathan Dauchy (Inactive) [ 08/Aug/16 ] |
|
Thanks for the quick review. Since this shouldn't be harmful, and is skipped for more than 100 clients anyway, we may not rush to get a rebuild into testing prior to the upgrade planned for next week. Will test as soon as time permits though. NOTE: we will need a backport to 2.7.2. |
| Comment by Peter Jones [ 08/Aug/16 ] |
|
Jian Could you please port this fix to 2.7 FE once it has landed to master Thanks Peter |
| Comment by Jian Yu [ 09/Aug/16 ] |
|
Here is the back-ported patch for Lustre 2.7 FE: http://review.whamcloud.com/22018 |
| Comment by Nathan Dauchy (Inactive) [ 23/Aug/16 ] |
|
For some reason, I am unable to view the backported patch. Is the link correct, or is there perhaps a permissions issue with my gerrit account? |
| Comment by Peter Jones [ 23/Aug/16 ] |
|
Nathan It must be the latter. Please send me an email about this issue and I'll get it sorted out. Peter |
| Comment by Gerrit Updater [ 29/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21813/ |
| Comment by Peter Jones [ 29/Aug/16 ] |
|
Landed for 2.9 |
| Comment by Nathan Dauchy (Inactive) [ 29/Aug/16 ] |
|
Please re-open until the backport patch lands to 2.7 FE. |