[LU-15095] lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1 Created: 13/Oct/21 Updated: 16/Mar/22 Resolved: 22/Jan/22 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Upstream |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Alex Zhuravlev | Assignee: | Vladimir Saveliev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I'm getting many messages like this: |
| Comments |
| Comment by Alex Zhuravlev [ 13/Oct/21 ] |
|
yet another interesting side-effect of that patch: == sanity test 901: don't leak a mgc lock on client umount ========================================================== 10:40:03 (1634121603) 192.168.121.177@tcp:/lustre /mnt/lustre lustre rw,checksum,flock,user_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt 0 0 Stopping client tmp.hAgqpwOV43 /mnt/lustre (opts:) Starting client: tmp.hAgqpwOV43: -o user_xattr,flock tmp.hAgqpwOV43@tcp:/lustre /mnt/lustre sanity test_901: @@@@@@ FAIL: mgc lock leak (16 != 17) Trace dump: = ./../tests/test-framework.sh:6330:error() = sanity.sh:27444:test_901() = ./../tests/test-framework.sh:6634:run_one() = ./../tests/test-framework.sh:6681:run_one_logged() = ./../tests/test-framework.sh:6522:run_test() = sanity.sh:27449:main() I disabled this line: do_node $(mgs_node) "$LCTL set_param -P *.*.lbug_on_grant_miscount=1"
and sanity/901 stops to fail |
| Comment by James A Simmons [ 13/Oct/21 ] |
|
Looking at the patch their is no lbug_on_grant_miscount on the MGS node. Only the MDS servers. Is lbug_on_grant_miscount meant for MGS servers or MDS servers? |
| Comment by Alex Zhuravlev [ 13/Oct/21 ] |
|
hmm, I think MGS was specified to make the parameter persistent, i.e. write it to the config logs for MDTs and OSTs (which do grants) |
| Comment by James A Simmons [ 19/Oct/21 ] |
|
Looking at it with fresh eyes and I see the -P now. Looking at the error I see a corner case missed. Normally when setting persistent values an udev event is sent out. In this case it doesn't and ends up with the fall back of using the upcall which does fail. The reason for this is that in process_param2_config() we use kset_find_obj() to find a top level kobj to use to send the event. In this case its '*' which doesn't exist. In this case we should just use the top kset kobject to send the event. |
| Comment by James A Simmons [ 09/Nov/21 ] |
|
I think the simple solution is change ..lbug_on_grant_miscount=1 to [mdt|obdfilter].*.lbug_on_grant_miscount=1 |
| Comment by Alex Zhuravlev [ 09/Nov/21 ] |
|
do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ? |
| Comment by James A Simmons [ 09/Nov/21 ] |
|
That is really good point. Do we want lbug_on_grant_miscount set across reboots? |
| Comment by Vladimir Saveliev [ 10/Nov/21 ] |
Without that the parameter lbug_on_grant_miscount would get turned off in tests which include failover. I did not get "lctl: error invoking upcall" in my tests and will debug the issue.
|
| Comment by Vladimir Saveliev [ 10/Nov/21 ] |
Do you mean to make it as parameter of module ptlrpc? int ldiskfs_track_declares_assert; module_param(ldiskfs_track_declares_assert, int, 0644); It sounds like a good idea, thanks. |
| Comment by Alex Zhuravlev [ 10/Nov/21 ] |
|
in some cases this parameter is set (written) into the config few times. to reproduce it should be enough to run llmount.sh locally: [ 30.040858] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params [ 30.161330] LustreError: 5029:0:(obd_config.c:1326:process_param2_config()) lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1: rc = -2; time 191us |
| Comment by Alex Zhuravlev [ 10/Nov/21 ] |
yes |
| Comment by Gerrit Updater [ 10/Nov/21 ] |
|
"Vladimir Saveliev <vlaidimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45521 |
| Comment by Gerrit Updater [ 23/Dec/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45521/ |
| Comment by Peter Jones [ 23/Dec/21 ] |
|
Landed for 2.15 |
| Comment by Andreas Dilger [ 16/Jan/22 ] |
|
Moving this patch to a module parameter is causing RHEL7.9 testing to fail 100% with: ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1' [console] Lustre: Lustre: Build Version: 2.14.56_68_g5914687 [console] ptlrpc: Unknown parameter `lbug_on_grant_miscount' modprobe: ERROR: could not insert 'ptlrpc': Unknown symbol in module, or unknown parameter (see dmesg) The build version is the same on the client and server, so it isn't a case of an old build being used on the client. I think the problem is that this is a client-only build being tested, and the module parameter is only for the server, so it just doesn't exist on the el7.9 client. I think the test-framework needs to be changed to only set this parameter on the OSS and MDS and not the client nodes. |
| Comment by Gerrit Updater [ 19/Jan/22 ] |
|
"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46185 |
| Comment by Gerrit Updater [ 21/Jan/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46185/ |