Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6063

conf-sanity test_76a fails on RHEL7, SLES12

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • el7 client, sles12 client
    • 3
    • 16882

    Description

      conf-sanity, test_76a fails every time on any el7 client as far as I can tell. This test attempts to prove permanent param changes made with 'lctl set_param -P'. This mechanism doesn't seem to work at all when the client is el7.

      I can manually reproduce the problem by mounting a lustre filesystem, observe the 'max_dirty_mb' param on the client with 'lctl get_param osc.*.max_dirty_mb' on the client, then manually alter that param on the mgs by manually exectuting 'lctl set_param -P osc.*.max_dirty_mb=64' from the command line on the mgs. If I have the lustre filesystem mounted on both an el6 and an el7 client I can see the change from 32 (the default) up to 64 in the results of get_param cmd on the el6 client after a few seconds. The value is never seen to change on the el7 client at all. It appears to stay at the default value of 32 forever, never visibly changing.

      The fact that the change can be observed on an el6 client indicates the change on the mgs is really happening and is eventually reaching the el6 client, but somehow it is never reflected back into the el7 client.

      There must be some significant difference on el7 causing the failure there, but I'm at a loss to explain it. I think I need a higher level expert to help with this problem. Without some solution I don't think we will get a 100% test run on an el7 client ever.

      Attachments

        Issue Links

          Activity

            [LU-6063] conf-sanity test_76a fails on RHEL7, SLES12
            pjones Peter Jones added a comment -

            Landed for 2.7

            pjones Peter Jones added a comment - Landed for 2.7

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13677/
            Subject: LU-6063 kernel: use proper flags for call_usermodehelper
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 8febfe0e30c5febdf716e4591c355199de4a6ab8

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13677/ Subject: LU-6063 kernel: use proper flags for call_usermodehelper Project: fs/lustre-release Branch: master Current Patch Set: Commit: 8febfe0e30c5febdf716e4591c355199de4a6ab8

            James, the details of the current lctl set_param -P implementation are in LU-2629 of you are interested. It isn't really a performance critical operation, but like anything there is probably room for improvement.

            adilger Andreas Dilger added a comment - James, the details of the current lctl set_param -P implementation are in LU-2629 of you are interested. It isn't really a performance critical operation, but like anything there is probably room for improvement.

            Verified the mod does indeed fix the problem, at least for el7 clients. The problem can no longer be reproduced either by manual command line commands or by conf-sanity, test 76a.

            Good call, James!

            bogl Bob Glossman (Inactive) added a comment - Verified the mod does indeed fix the problem, at least for el7 clients. The problem can no longer be reproduced either by manual command line commands or by conf-sanity, test 76a. Good call, James!
            simmonsja James A Simmons added a comment - - edited

            Correct. Also the logic for UMH_WAIT_PROC and UHM_NO_WAIT was the same at one time. See https://lkml.org/lkml/2010/3/9/368.

            simmonsja James A Simmons added a comment - - edited Correct. Also the logic for UMH_WAIT_PROC and UHM_NO_WAIT was the same at one time. See https://lkml.org/lkml/2010/3/9/368 .

            If I'm understanding the commit header, this problem is due to the fact that UMH_WAIT_PROC was 1 in el6, but is 2 in el7 and later. If we has used the #define'd name it would have been right in all builds, but using a literal number instead made it wrong in newer kernels.

            bogl Bob Glossman (Inactive) added a comment - If I'm understanding the commit header, this problem is due to the fact that UMH_WAIT_PROC was 1 in el6, but is 2 in el7 and later. If we has used the #define'd name it would have been right in all builds, but using a literal number instead made it wrong in newer kernels.

            James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13677
            Subject: LU-6063 kernel: use proper flags for call_usermodehelper
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 15139bcaefc1e3d222b86ed6077eea89eee1136c

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13677 Subject: LU-6063 kernel: use proper flags for call_usermodehelper Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 15139bcaefc1e3d222b86ed6077eea89eee1136c
            bogl Bob Glossman (Inactive) added a comment - - edited

            not sure how this maps to upcall problems on MDS or MGS. problem is seen with el6 MDS, el7 (or sles12) only on clients.

            bogl Bob Glossman (Inactive) added a comment - - edited not sure how this maps to upcall problems on MDS or MGS. problem is seen with el6 MDS, el7 (or sles12) only on clients.

            James, haven't noticed any problems with upcalls (besides possibly this one) but haven't been looking carefully. Doesn't extended group membership use it some? think there are some sanity tests for that.

            bogl Bob Glossman (Inactive) added a comment - James, haven't noticed any problems with upcalls (besides possibly this one) but haven't been looking carefully. Doesn't extended group membership use it some? think there are some sanity tests for that.
            simmonsja James A Simmons added a comment - - edited

            Bob have you had any problems with the up call functionality on the MDS with RHEL7 testing? Looking at the source it seems that call_usermodehelper passes the right flag.

            simmonsja James A Simmons added a comment - - edited Bob have you had any problems with the up call functionality on the MDS with RHEL7 testing? Looking at the source it seems that call_usermodehelper passes the right flag.

            People

              bogl Bob Glossman (Inactive)
              bogl Bob Glossman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: