[LU-15095] lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1 Created: 13/Oct/21  Updated: 16/Mar/22  Resolved: 22/Jan/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Critical
Reporter: Alex Zhuravlev Assignee: Vladimir Saveliev
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-14543 tgt_grant_discard(): avoid tgd->tgd_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I'm getting many messages like this:
lctl: error invoking upcall /usr/sbin/lctl set_param ..lbug_on_grant_miscount=1
in the logs.
this was introduced in bb5d81ea95 ("LU-14543 target: prevent overflowing of tgd->tgd_tot_granted")
I don't quite understand why this clearly debugging tunable needs to be persisten in the config logs?
IMO, the better would be to have node-wide non-persisten tunable like osd's track_declare_assert



 Comments   
Comment by Alex Zhuravlev [ 13/Oct/21 ]

yet another interesting side-effect of that patch:

== sanity test 901: don't leak a mgc lock on client umount ========================================================== 10:40:03 (1634121603)
192.168.121.177@tcp:/lustre /mnt/lustre lustre rw,checksum,flock,user_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt 0 0
Stopping client tmp.hAgqpwOV43 /mnt/lustre (opts:)
Starting client: tmp.hAgqpwOV43:  -o user_xattr,flock tmp.hAgqpwOV43@tcp:/lustre /mnt/lustre
 sanity test_901: @@@@@@ FAIL: mgc lock leak (16 != 17) 
  Trace dump:
  = ./../tests/test-framework.sh:6330:error()
  = sanity.sh:27444:test_901()
  = ./../tests/test-framework.sh:6634:run_one()
  = ./../tests/test-framework.sh:6681:run_one_logged()
  = ./../tests/test-framework.sh:6522:run_test()
  = sanity.sh:27449:main()

I disabled this line:

do_node $(mgs_node) "$LCTL set_param -P *.*.lbug_on_grant_miscount=1"

and sanity/901 stops to fail

Comment by James A Simmons [ 13/Oct/21 ]

Looking at the patch their is no lbug_on_grant_miscount on the MGS node. Only the MDS servers. Is lbug_on_grant_miscount meant for MGS servers or MDS servers?

Comment by Alex Zhuravlev [ 13/Oct/21 ]

hmm, I think MGS was specified to make the parameter persistent, i.e. write it to the config logs for MDTs and OSTs (which do grants)

Comment by James A Simmons [ 19/Oct/21 ]

Looking at it with fresh eyes and I see the -P now. Looking at the error I see a corner case missed. Normally when setting persistent values an udev event is sent out. In this case it doesn't and ends up with the fall back of using the upcall which does fail. The reason for this is that in process_param2_config() we use kset_find_obj() to find a top level kobj to use to send the event. In this case its '*' which doesn't exist. In this case we should just use the top kset kobject to send the event.

Comment by James A Simmons [ 09/Nov/21 ]

I think the simple solution is change ..lbug_on_grant_miscount=1 to [mdt|obdfilter].*.lbug_on_grant_miscount=1

Comment by Alex Zhuravlev [ 09/Nov/21 ]

do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ?

Comment by James A Simmons [ 09/Nov/21 ]

That is really good point. Do we want lbug_on_grant_miscount set across reboots?

Comment by Vladimir Saveliev [ 10/Nov/21 ]

Do we want lbug_on_grant_miscount set across reboots?

Without that the parameter lbug_on_grant_miscount would get turned off in tests which include failover.
That is, it was made permanent intentionally.

I did not get "lctl: error invoking upcall" in my tests and will debug the issue.

 

Comment by Vladimir Saveliev [ 10/Nov/21 ]

do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ?

Do you mean to make it as parameter of module ptlrpc?

int ldiskfs_track_declares_assert;
module_param(ldiskfs_track_declares_assert, int, 0644);

It sounds like a good idea, thanks.

Comment by Alex Zhuravlev [ 10/Nov/21 ]

in some cases this parameter is set (written) into the config few times.

to reproduce it should be enough to run llmount.sh locally:

[   30.040858] Lustre: Modifying parameter general.*.*.lbug_on_grant_miscount in log params
[   30.161330] LustreError: 5029:0:(obd_config.c:1326:process_param2_config()) lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1: rc = -2; time 191us
Comment by Alex Zhuravlev [ 10/Nov/21 ]

Do you mean to make it as parameter of module ptlrpc?

yes

Comment by Gerrit Updater [ 10/Nov/21 ]

"Vladimir Saveliev <vlaidimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45521
Subject: LU-15095 target: lbug_on_grant_miscount module parameter
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 91f914fe01c71280523c5fee3bf2d31db593c9e5

Comment by Gerrit Updater [ 23/Dec/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45521/
Subject: LU-15095 target: lbug_on_grant_miscount module parameter
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 2c787065441ee60c6c163dc77851d0964f81a89c

Comment by Peter Jones [ 23/Dec/21 ]

Landed for 2.15

Comment by Andreas Dilger [ 16/Jan/22 ]

Moving this patch to a module parameter is causing RHEL7.9 testing to fail 100% with:
https://testing.whamcloud.com/test_sets/8a2e1a74-c60f-4e33-bbfa-2c3a7efa7e13

ptlrpc/ptlrpc options: 'lbug_on_grant_miscount=1'
[console] Lustre: Lustre: Build Version: 2.14.56_68_g5914687
[console] ptlrpc: Unknown parameter `lbug_on_grant_miscount'
modprobe: ERROR: could not insert 'ptlrpc': Unknown symbol in module, or unknown parameter (see dmesg)

The build version is the same on the client and server, so it isn't a case of an old build being used on the client.

I think the problem is that this is a client-only build being tested, and the module parameter is only for the server, so it just doesn't exist on the el7.9 client. I think the test-framework needs to be changed to only set this parameter on the OSS and MDS and not the client nodes.

Comment by Gerrit Updater [ 19/Jan/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/46185
Subject: LU-15095 tests: skip lbug_on_grant_miscount on client
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c168c309860dbf9745af1105ed3b236c0e2ce89c

Comment by Gerrit Updater [ 21/Jan/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/46185/
Subject: LU-15095 tests: skip lbug_on_grant_miscount on client
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 49e29f38343ce0389df0aecf308b0986de94c029

Generated at Sat Feb 10 03:15:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.