[LU-6063] conf-sanity test_76a fails on RHEL7, SLES12 Created: 21/Dec/14  Updated: 01/Apr/18  Resolved: 08/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Bob Glossman (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: HB
Environment:

el7 client, sles12 client


Issue Links:
Related
is related to LU-6123 conf-sanity test_72: FAIL: mount clie... Resolved
is related to LU-6048 Kernel update [RHEL7 3.10.0-123.13.2.... Resolved
is related to LU-4416 support for 3.12 linux kernel Resolved
is related to LU-10869 conf-sanity test 76a fails with 'erro... Resolved
Severity: 3
Rank (Obsolete): 16882

 Description   

conf-sanity, test_76a fails every time on any el7 client as far as I can tell. This test attempts to prove permanent param changes made with 'lctl set_param -P'. This mechanism doesn't seem to work at all when the client is el7.

I can manually reproduce the problem by mounting a lustre filesystem, observe the 'max_dirty_mb' param on the client with 'lctl get_param osc.*.max_dirty_mb' on the client, then manually alter that param on the mgs by manually exectuting 'lctl set_param -P osc.*.max_dirty_mb=64' from the command line on the mgs. If I have the lustre filesystem mounted on both an el6 and an el7 client I can see the change from 32 (the default) up to 64 in the results of get_param cmd on the el6 client after a few seconds. The value is never seen to change on the el7 client at all. It appears to stay at the default value of 32 forever, never visibly changing.

The fact that the change can be observed on an el6 client indicates the change on the mgs is really happening and is eventually reaching the el6 client, but somehow it is never reflected back into the el7 client.

There must be some significant difference on el7 causing the failure there, but I'm at a loss to explain it. I think I need a higher level expert to help with this problem. Without some solution I don't think we will get a 100% test run on an el7 client ever.



 Comments   
Comment by Peter Jones [ 22/Dec/14 ]

Mike

Could you please look into this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 22/Dec/14 ]

this problem seems to be not exclusive to el7. I see similar behavior on sles12 clients.

btw, all cases where I can reproduce the problem are with el6 servers.

Comment by Andreas Dilger [ 15/Jan/15 ]

James S., any ideas on this? I'd guess that the RHEL7 and SLES12 kernels are using a new /proc implementation, and this isn't working properly with the MGS/MGC-driven tunables?

Comment by James A Simmons [ 15/Jan/15 ]

Actually no one is using the old proc handling methods. It all has been ported over to seq_file. I can take a look at why it is failing.

Comment by James A Simmons [ 16/Jan/15 ]

Bob does the server back end need to be RHEL7 or does this problem show up with just upgraded clients?

Comment by Bob Glossman (Inactive) [ 16/Jan/15 ]

James, see previous comment:

"btw, all cases where I can reproduce the problem are with el6 servers."

Comment by Bob Glossman (Inactive) [ 19/Jan/15 ]

randomly casting around for things RHEL 7 and SLES 12 have in common I notice that they both have systemd while older versions don't. not saying that this has anything to do with anything, but it is a significant diff in runtime environment.

Comment by James A Simmons [ 20/Jan/15 ]

Looked more closely at this problem and it reminds me of when ORNL encountered LU-1014. Its one of these class_process_config not working or not being called. Due to travel I might not get to it this week. I will see if I can duplicate the problem as soon as I can.

Comment by James A Simmons [ 26/Jan/15 ]

Just got back today. I can easily reproduce the problem. For that test have you tried to see if the obdfilter.*.client_cache_count is also broken on either RHEL6.6 or RHEL7 servers?

Comment by Bob Glossman (Inactive) [ 26/Jan/15 ]

no, haven't looked at anything beyond the first failure related to max_dirty_mb. In running the test as is it never gets beyond that. In trying to reproduce the failure manually I focused on max_dirty_mb only.

Comment by Andreas Dilger [ 03/Feb/15 ]

James, any suggestions on how to "re-hook" the /proc entries to the handler functions?

Comment by James A Simmons [ 04/Feb/15 ]

Examining the logs I see the MGS is doing the right thing and sending the llog changes to the client. I'm thinking the bug is in the class_config_llog_handler code.

Comment by James A Simmons [ 06/Feb/15 ]

I finished examine the logs and have determined that the client side is doing the right thing. Once the client receives the packet so it can sync its llog with the MGS it then does a up call to lctl using the call_usermodehelper code. For some reason lctl fails to update the proc parameters. IMNSHO calling a user land utility to change proc entries from the kernel is ugly. I will see if any changes have happened to the usermodehelper api.

Comment by James A Simmons [ 06/Feb/15 ]

Bob have you had any problems with the up call functionality on the MDS with RHEL7 testing? Looking at the source it seems that call_usermodehelper passes the right flag.

Comment by Bob Glossman (Inactive) [ 06/Feb/15 ]

James, haven't noticed any problems with upcalls (besides possibly this one) but haven't been looking carefully. Doesn't extended group membership use it some? think there are some sanity tests for that.

Comment by Bob Glossman (Inactive) [ 06/Feb/15 ]

not sure how this maps to upcall problems on MDS or MGS. problem is seen with el6 MDS, el7 (or sles12) only on clients.

Comment by Gerrit Updater [ 06/Feb/15 ]

James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13677
Subject: LU-6063 kernel: use proper flags for call_usermodehelper
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 15139bcaefc1e3d222b86ed6077eea89eee1136c

Comment by Bob Glossman (Inactive) [ 06/Feb/15 ]

If I'm understanding the commit header, this problem is due to the fact that UMH_WAIT_PROC was 1 in el6, but is 2 in el7 and later. If we has used the #define'd name it would have been right in all builds, but using a literal number instead made it wrong in newer kernels.

Comment by James A Simmons [ 06/Feb/15 ]

Correct. Also the logic for UMH_WAIT_PROC and UHM_NO_WAIT was the same at one time. See https://lkml.org/lkml/2010/3/9/368.

Comment by Bob Glossman (Inactive) [ 06/Feb/15 ]

Verified the mod does indeed fix the problem, at least for el7 clients. The problem can no longer be reproduced either by manual command line commands or by conf-sanity, test 76a.

Good call, James!

Comment by Andreas Dilger [ 07/Feb/15 ]

James, the details of the current lctl set_param -P implementation are in LU-2629 of you are interested. It isn't really a performance critical operation, but like anything there is probably room for improvement.

Comment by Gerrit Updater [ 08/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13677/
Subject: LU-6063 kernel: use proper flags for call_usermodehelper
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8febfe0e30c5febdf716e4591c355199de4a6ab8

Comment by Peter Jones [ 08/Feb/15 ]

Landed for 2.7

Generated at Sat Feb 10 01:56:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.