Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6063

conf-sanity test_76a fails on RHEL7, SLES12

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.7.0
    • el7 client, sles12 client
    • 3
    • 16882

    Description

      conf-sanity, test_76a fails every time on any el7 client as far as I can tell. This test attempts to prove permanent param changes made with 'lctl set_param -P'. This mechanism doesn't seem to work at all when the client is el7.

      I can manually reproduce the problem by mounting a lustre filesystem, observe the 'max_dirty_mb' param on the client with 'lctl get_param osc.*.max_dirty_mb' on the client, then manually alter that param on the mgs by manually exectuting 'lctl set_param -P osc.*.max_dirty_mb=64' from the command line on the mgs. If I have the lustre filesystem mounted on both an el6 and an el7 client I can see the change from 32 (the default) up to 64 in the results of get_param cmd on the el6 client after a few seconds. The value is never seen to change on the el7 client at all. It appears to stay at the default value of 32 forever, never visibly changing.

      The fact that the change can be observed on an el6 client indicates the change on the mgs is really happening and is eventually reaching the el6 client, but somehow it is never reflected back into the el7 client.

      There must be some significant difference on el7 causing the failure there, but I'm at a loss to explain it. I think I need a higher level expert to help with this problem. Without some solution I don't think we will get a 100% test run on an el7 client ever.

      Attachments

        Issue Links

          Activity

            [LU-6063] conf-sanity test_76a fails on RHEL7, SLES12

            Examining the logs I see the MGS is doing the right thing and sending the llog changes to the client. I'm thinking the bug is in the class_config_llog_handler code.

            simmonsja James A Simmons added a comment - Examining the logs I see the MGS is doing the right thing and sending the llog changes to the client. I'm thinking the bug is in the class_config_llog_handler code.

            James, any suggestions on how to "re-hook" the /proc entries to the handler functions?

            adilger Andreas Dilger added a comment - James, any suggestions on how to "re-hook" the /proc entries to the handler functions?

            no, haven't looked at anything beyond the first failure related to max_dirty_mb. In running the test as is it never gets beyond that. In trying to reproduce the failure manually I focused on max_dirty_mb only.

            bogl Bob Glossman (Inactive) added a comment - no, haven't looked at anything beyond the first failure related to max_dirty_mb. In running the test as is it never gets beyond that. In trying to reproduce the failure manually I focused on max_dirty_mb only.

            Just got back today. I can easily reproduce the problem. For that test have you tried to see if the obdfilter.*.client_cache_count is also broken on either RHEL6.6 or RHEL7 servers?

            simmonsja James A Simmons added a comment - Just got back today. I can easily reproduce the problem. For that test have you tried to see if the obdfilter.*.client_cache_count is also broken on either RHEL6.6 or RHEL7 servers?

            Looked more closely at this problem and it reminds me of when ORNL encountered LU-1014. Its one of these class_process_config not working or not being called. Due to travel I might not get to it this week. I will see if I can duplicate the problem as soon as I can.

            simmonsja James A Simmons added a comment - Looked more closely at this problem and it reminds me of when ORNL encountered LU-1014 . Its one of these class_process_config not working or not being called. Due to travel I might not get to it this week. I will see if I can duplicate the problem as soon as I can.

            randomly casting around for things RHEL 7 and SLES 12 have in common I notice that they both have systemd while older versions don't. not saying that this has anything to do with anything, but it is a significant diff in runtime environment.

            bogl Bob Glossman (Inactive) added a comment - randomly casting around for things RHEL 7 and SLES 12 have in common I notice that they both have systemd while older versions don't. not saying that this has anything to do with anything, but it is a significant diff in runtime environment.

            James, see previous comment:

            "btw, all cases where I can reproduce the problem are with el6 servers."

            bogl Bob Glossman (Inactive) added a comment - James, see previous comment: "btw, all cases where I can reproduce the problem are with el6 servers."

            Bob does the server back end need to be RHEL7 or does this problem show up with just upgraded clients?

            simmonsja James A Simmons added a comment - Bob does the server back end need to be RHEL7 or does this problem show up with just upgraded clients?

            Actually no one is using the old proc handling methods. It all has been ported over to seq_file. I can take a look at why it is failing.

            simmonsja James A Simmons added a comment - Actually no one is using the old proc handling methods. It all has been ported over to seq_file. I can take a look at why it is failing.

            James S., any ideas on this? I'd guess that the RHEL7 and SLES12 kernels are using a new /proc implementation, and this isn't working properly with the MGS/MGC-driven tunables?

            adilger Andreas Dilger added a comment - James S., any ideas on this? I'd guess that the RHEL7 and SLES12 kernels are using a new /proc implementation, and this isn't working properly with the MGS/MGC-driven tunables?

            this problem seems to be not exclusive to el7. I see similar behavior on sles12 clients.

            btw, all cases where I can reproduce the problem are with el6 servers.

            bogl Bob Glossman (Inactive) added a comment - this problem seems to be not exclusive to el7. I see similar behavior on sles12 clients. btw, all cases where I can reproduce the problem are with el6 servers.

            People

              bogl Bob Glossman (Inactive)
              bogl Bob Glossman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: