Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15095

lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • Upstream
    • None
    • 3
    • 9223372036854775807

    Description

      I'm getting many messages like this:
      lctl: error invoking upcall /usr/sbin/lctl set_param ..lbug_on_grant_miscount=1
      in the logs.
      this was introduced in bb5d81ea95 ("LU-14543 target: prevent overflowing of tgd->tgd_tot_granted")
      I don't quite understand why this clearly debugging tunable needs to be persisten in the config logs?
      IMO, the better would be to have node-wide non-persisten tunable like osd's track_declare_assert

      Attachments

        Issue Links

          Activity

            [LU-15095] lctl: error invoking upcall /usr/sbin/lctl set_param *.*.lbug_on_grant_miscount=1

            do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ?

            Do you mean to make it as parameter of module ptlrpc?

            int ldiskfs_track_declares_assert;
            module_param(ldiskfs_track_declares_assert, int, 0644);
            

            It sounds like a good idea, thanks.

            vsaveliev Vladimir Saveliev added a comment - do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ? Do you mean to make it as parameter of module ptlrpc? int ldiskfs_track_declares_assert; module_param(ldiskfs_track_declares_assert, int , 0644); It sounds like a good idea, thanks.
            vsaveliev Vladimir Saveliev added a comment - - edited

            Do we want lbug_on_grant_miscount set across reboots?

            Without that the parameter lbug_on_grant_miscount would get turned off in tests which include failover.
            That is, it was made permanent intentionally.

            I did not get "lctl: error invoking upcall" in my tests and will debug the issue.

             

            vsaveliev Vladimir Saveliev added a comment - - edited Do we want lbug_on_grant_miscount set across reboots? Without that the parameter lbug_on_grant_miscount would get turned off in tests which include failover. That is, it was made permanent intentionally. I did not get "lctl: error invoking upcall" in my tests and will debug the issue.  

            That is really good point. Do we want lbug_on_grant_miscount set across reboots?

            simmonsja James A Simmons added a comment - That is really good point. Do we want lbug_on_grant_miscount set across reboots?

            do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ?

            bzzz Alex Zhuravlev added a comment - do we really need to save this to the log? why not use a variable like cfs_fail_loc or ldiskfs_track_declares_assert ?

            I think the simple solution is change ..lbug_on_grant_miscount=1 to [mdt|obdfilter].*.lbug_on_grant_miscount=1

            simmonsja James A Simmons added a comment - I think the simple solution is change . .lbug_on_grant_miscount=1 to [mdt|obdfilter] .*.lbug_on_grant_miscount=1

            Looking at it with fresh eyes and I see the -P now. Looking at the error I see a corner case missed. Normally when setting persistent values an udev event is sent out. In this case it doesn't and ends up with the fall back of using the upcall which does fail. The reason for this is that in process_param2_config() we use kset_find_obj() to find a top level kobj to use to send the event. In this case its '*' which doesn't exist. In this case we should just use the top kset kobject to send the event.

            simmonsja James A Simmons added a comment - Looking at it with fresh eyes and I see the -P now. Looking at the error I see a corner case missed. Normally when setting persistent values an udev event is sent out. In this case it doesn't and ends up with the fall back of using the upcall which does fail. The reason for this is that in process_param2_config() we use kset_find_obj() to find a top level kobj to use to send the event. In this case its '*' which doesn't exist. In this case we should just use the top kset kobject to send the event.

            hmm, I think MGS was specified to make the parameter persistent, i.e. write it to the config logs for MDTs and OSTs (which do grants)

            bzzz Alex Zhuravlev added a comment - hmm, I think MGS was specified to make the parameter persistent, i.e. write it to the config logs for MDTs and OSTs (which do grants)
            simmonsja James A Simmons added a comment - - edited

            Looking at the patch their is no lbug_on_grant_miscount on the MGS node. Only the MDS servers. Is lbug_on_grant_miscount meant for MGS servers or MDS servers?

            simmonsja James A Simmons added a comment - - edited Looking at the patch their is no lbug_on_grant_miscount on the MGS node. Only the MDS servers. Is lbug_on_grant_miscount meant for MGS servers or MDS servers?

            yet another interesting side-effect of that patch:

            == sanity test 901: don't leak a mgc lock on client umount ========================================================== 10:40:03 (1634121603)
            192.168.121.177@tcp:/lustre /mnt/lustre lustre rw,checksum,flock,user_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt 0 0
            Stopping client tmp.hAgqpwOV43 /mnt/lustre (opts:)
            Starting client: tmp.hAgqpwOV43:  -o user_xattr,flock tmp.hAgqpwOV43@tcp:/lustre /mnt/lustre
             sanity test_901: @@@@@@ FAIL: mgc lock leak (16 != 17) 
              Trace dump:
              = ./../tests/test-framework.sh:6330:error()
              = sanity.sh:27444:test_901()
              = ./../tests/test-framework.sh:6634:run_one()
              = ./../tests/test-framework.sh:6681:run_one_logged()
              = ./../tests/test-framework.sh:6522:run_test()
              = sanity.sh:27449:main()
            

            I disabled this line:

            do_node $(mgs_node) "$LCTL set_param -P *.*.lbug_on_grant_miscount=1"

            and sanity/901 stops to fail

            bzzz Alex Zhuravlev added a comment - yet another interesting side-effect of that patch: == sanity test 901: don't leak a mgc lock on client umount ========================================================== 10:40:03 (1634121603) 192.168.121.177@tcp:/lustre /mnt/lustre lustre rw,checksum,flock,user_xattr,lruresize,lazystatfs,nouser_fid2path,verbose,noencrypt 0 0 Stopping client tmp.hAgqpwOV43 /mnt/lustre (opts:) Starting client: tmp.hAgqpwOV43: -o user_xattr,flock tmp.hAgqpwOV43@tcp:/lustre /mnt/lustre sanity test_901: @@@@@@ FAIL: mgc lock leak (16 != 17) Trace dump: = ./../tests/test-framework.sh:6330:error() = sanity.sh:27444:test_901() = ./../tests/test-framework.sh:6634:run_one() = ./../tests/test-framework.sh:6681:run_one_logged() = ./../tests/test-framework.sh:6522:run_test() = sanity.sh:27449:main() I disabled this line: do_node $(mgs_node) "$LCTL set_param -P *.*.lbug_on_grant_miscount=1" and sanity/901 stops to fail

            People

              vsaveliev Vladimir Saveliev
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: