[LU-8480] Server syslog: ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760 Created: 05/Aug/16  Updated: 15/Jun/17  Resolved: 29/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Major
Reporter: Nathan Dauchy (Inactive) Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Mostly CentOS-6.8, with 6.7 kernel 2.6.32_573.26.1.el6
Lustre 2.7.2-2nasS_mofed32v1.el67.20160517v2


Issue Links:
Duplicate
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error message seen in system log on at least one production server:
(running lustre-2.7.2-1.1nasS_mofed32v1)

May 20 18:38:58 nbp1-oss1 kernel: LustreError: 38115:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 98847872 != fo_tot_granted 100945024

Reproduced on test system, (running lustre-2.7.2-2nasS_mofed32v1.el67.20160517v2), by mounting just 8 clients, running IOR write from 1, IOR read from 1, mdtest on the other 6. Then upon unmount of all 8 clients simultaneously, get the messages on each (of the two) OSS nodes:

Aug  5 08:06:36 service320 kernel: LustreError: 76476:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760
Aug  5 08:06:36 service323 kernel: LustreError: 76171:0:(ofd_grant.c:183:ofd_grant_sanity_check()) ofd_obd_disconnect: tot_granted 69347328 != fo_tot_granted 102901760

What does this message mean, and is it cause for major concern?

We are testing in preparation for upgrade to all remaining production file systems on 8/15, so need to know whether to proceed ASAP.



 Comments   
Comment by Andreas Dilger [ 08/Aug/16 ]

It looks like the delta between the grant amounts is about 32MB, which is the amount of grant held by a single client after it has done some writes to the OST. It looks like there is a race window between traversing all of the client exports to accumulate their grant, and the running total that is kept for the whole OST when a client is being unmounted, and is possible to be triggered when multiple clients are being unmounted. This doesn't seem harmful in itself, and for systems with more than 100 clients this check is skipped because it slows down the unmount too much (it is O(n^2) to accumulate the per-export grants as each client unmounts).

It looks like there may be a very simple fix, for which I can push a patch if you could give it a test.

Comment by Gerrit Updater [ 08/Aug/16 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: http://review.whamcloud.com/21813
Subject: LU-8480 ofd: hold obd_dev_lock across grant comparison
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 14d7415e9048a287aabde9a5d03dd2d9b6c48ae9

Comment by Nathan Dauchy (Inactive) [ 08/Aug/16 ]

Thanks for the quick review. Since this shouldn't be harmful, and is skipped for more than 100 clients anyway, we may not rush to get a rebuild into testing prior to the upgrade planned for next week. Will test as soon as time permits though.

NOTE: we will need a backport to 2.7.2.

Comment by Peter Jones [ 08/Aug/16 ]

Jian

Could you please port this fix to 2.7 FE once it has landed to master

Thanks

Peter

Comment by Jian Yu [ 09/Aug/16 ]

Here is the back-ported patch for Lustre 2.7 FE: http://review.whamcloud.com/22018

Comment by Nathan Dauchy (Inactive) [ 23/Aug/16 ]

For some reason, I am unable to view the backported patch. Is the link correct, or is there perhaps a permissions issue with my gerrit account?

Comment by Peter Jones [ 23/Aug/16 ]

Nathan

It must be the latter. Please send me an email about this issue and I'll get it sorted out.

Peter

Comment by Gerrit Updater [ 29/Aug/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21813/
Subject: LU-8480 ofd: hold obd_dev_lock across grant comparison
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cf54bda257dc3287458a9157bb4647f05d8f8469

Comment by Peter Jones [ 29/Aug/16 ]

Landed for 2.9

Comment by Nathan Dauchy (Inactive) [ 29/Aug/16 ]

Please re-open until the backport patch lands to 2.7 FE.

Generated at Sat Feb 10 02:17:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.