[LU-6063] conf-sanity test_76a fails on RHEL7, SLES12 Created: 21/Dec/14 Updated: 01/Apr/18 Resolved: 08/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Bob Glossman (Inactive) | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | HB | ||
| Environment: |
el7 client, sles12 client |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 16882 | ||||||||||||||||||||
| Description |
|
conf-sanity, test_76a fails every time on any el7 client as far as I can tell. This test attempts to prove permanent param changes made with 'lctl set_param -P'. This mechanism doesn't seem to work at all when the client is el7. I can manually reproduce the problem by mounting a lustre filesystem, observe the 'max_dirty_mb' param on the client with 'lctl get_param osc.*.max_dirty_mb' on the client, then manually alter that param on the mgs by manually exectuting 'lctl set_param -P osc.*.max_dirty_mb=64' from the command line on the mgs. If I have the lustre filesystem mounted on both an el6 and an el7 client I can see the change from 32 (the default) up to 64 in the results of get_param cmd on the el6 client after a few seconds. The value is never seen to change on the el7 client at all. It appears to stay at the default value of 32 forever, never visibly changing. The fact that the change can be observed on an el6 client indicates the change on the mgs is really happening and is eventually reaching the el6 client, but somehow it is never reflected back into the el7 client. There must be some significant difference on el7 causing the failure there, but I'm at a loss to explain it. I think I need a higher level expert to help with this problem. Without some solution I don't think we will get a 100% test run on an el7 client ever. |
| Comments |
| Comment by Peter Jones [ 22/Dec/14 ] |
|
Mike Could you please look into this one? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 22/Dec/14 ] |
|
this problem seems to be not exclusive to el7. I see similar behavior on sles12 clients. btw, all cases where I can reproduce the problem are with el6 servers. |
| Comment by Andreas Dilger [ 15/Jan/15 ] |
|
James S., any ideas on this? I'd guess that the RHEL7 and SLES12 kernels are using a new /proc implementation, and this isn't working properly with the MGS/MGC-driven tunables? |
| Comment by James A Simmons [ 15/Jan/15 ] |
|
Actually no one is using the old proc handling methods. It all has been ported over to seq_file. I can take a look at why it is failing. |
| Comment by James A Simmons [ 16/Jan/15 ] |
|
Bob does the server back end need to be RHEL7 or does this problem show up with just upgraded clients? |
| Comment by Bob Glossman (Inactive) [ 16/Jan/15 ] |
|
James, see previous comment: "btw, all cases where I can reproduce the problem are with el6 servers." |
| Comment by Bob Glossman (Inactive) [ 19/Jan/15 ] |
|
randomly casting around for things RHEL 7 and SLES 12 have in common I notice that they both have systemd while older versions don't. not saying that this has anything to do with anything, but it is a significant diff in runtime environment. |
| Comment by James A Simmons [ 20/Jan/15 ] |
|
Looked more closely at this problem and it reminds me of when ORNL encountered |
| Comment by James A Simmons [ 26/Jan/15 ] |
|
Just got back today. I can easily reproduce the problem. For that test have you tried to see if the obdfilter.*.client_cache_count is also broken on either RHEL6.6 or RHEL7 servers? |
| Comment by Bob Glossman (Inactive) [ 26/Jan/15 ] |
|
no, haven't looked at anything beyond the first failure related to max_dirty_mb. In running the test as is it never gets beyond that. In trying to reproduce the failure manually I focused on max_dirty_mb only. |
| Comment by Andreas Dilger [ 03/Feb/15 ] |
|
James, any suggestions on how to "re-hook" the /proc entries to the handler functions? |
| Comment by James A Simmons [ 04/Feb/15 ] |
|
Examining the logs I see the MGS is doing the right thing and sending the llog changes to the client. I'm thinking the bug is in the class_config_llog_handler code. |
| Comment by James A Simmons [ 06/Feb/15 ] |
|
I finished examine the logs and have determined that the client side is doing the right thing. Once the client receives the packet so it can sync its llog with the MGS it then does a up call to lctl using the call_usermodehelper code. For some reason lctl fails to update the proc parameters. IMNSHO calling a user land utility to change proc entries from the kernel is ugly. I will see if any changes have happened to the usermodehelper api. |
| Comment by James A Simmons [ 06/Feb/15 ] |
|
Bob have you had any problems with the up call functionality on the MDS with RHEL7 testing? Looking at the source it seems that call_usermodehelper passes the right flag. |
| Comment by Bob Glossman (Inactive) [ 06/Feb/15 ] |
|
James, haven't noticed any problems with upcalls (besides possibly this one) but haven't been looking carefully. Doesn't extended group membership use it some? think there are some sanity tests for that. |
| Comment by Bob Glossman (Inactive) [ 06/Feb/15 ] |
|
not sure how this maps to upcall problems on MDS or MGS. problem is seen with el6 MDS, el7 (or sles12) only on clients. |
| Comment by Gerrit Updater [ 06/Feb/15 ] |
|
James Simmons (uja.ornl@gmail.com) uploaded a new patch: http://review.whamcloud.com/13677 |
| Comment by Bob Glossman (Inactive) [ 06/Feb/15 ] |
|
If I'm understanding the commit header, this problem is due to the fact that UMH_WAIT_PROC was 1 in el6, but is 2 in el7 and later. If we has used the #define'd name it would have been right in all builds, but using a literal number instead made it wrong in newer kernels. |
| Comment by James A Simmons [ 06/Feb/15 ] |
|
Correct. Also the logic for UMH_WAIT_PROC and UHM_NO_WAIT was the same at one time. See https://lkml.org/lkml/2010/3/9/368. |
| Comment by Bob Glossman (Inactive) [ 06/Feb/15 ] |
|
Verified the mod does indeed fix the problem, at least for el7 clients. The problem can no longer be reproduced either by manual command line commands or by conf-sanity, test 76a. Good call, James! |
| Comment by Andreas Dilger [ 07/Feb/15 ] |
|
James, the details of the current lctl set_param -P implementation are in |
| Comment by Gerrit Updater [ 08/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13677/ |
| Comment by Peter Jones [ 08/Feb/15 ] |
|
Landed for 2.7 |