[LU-3798] replay-single test_86: configuration log errors Created: 20/Aug/13  Updated: 14/Dec/21  Resolved: 14/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: zfs

Issue Links:
Duplicate
is duplicated by LU-3800 Interop 2.1.5<->2.5 failure on test s... Resolved
Related
is related to LU-3155 Permanent parameters with lctl set_pa... Resolved
Severity: 3
Rank (Obsolete): 9816

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ae12d972-092d-11e3-b004-52540035b04c.

The sub-test test_86 failed with the following error:

test_86 failed with 5

Info required for matching: replay-single 86

Client console log:

02:47:43:if [ $running -eq 0 ] ; then
02:47:43:    mkdir -p /mnt/lustre;
02:47:43:    mount -t lustre -o user_xattr,acl,flock wtm-10vm7@tcp:/lustre /mnt/lustre;
02:47:43:    rc=$?;
02:47:43:fi;
02:47:43:exit $rc
02:47:43:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
02:47:43:LustreError: 15c-8: MGC10.10.16.120@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
02:47:43:LustreError: 9196:0:(llite_lib.c:1046:ll_fill_super()) Unable to process log: -5
02:47:43:Lustre: Unmounted lustre-client


 Comments   
Comment by Andreas Dilger [ 21/Aug/13 ]

I suspect this is some kind of race between mounting the MDS/MGS (which is slow for some reason) and remounting the client but the MGS is not yet ready. We could have the client retry the mount a couple of times (with "-o retry=5" mount option on the client) to see if that solves the problem?

Comment by Andreas Dilger [ 21/Aug/13 ]

It looks like this test only started failing regularly on July 31st (only 4 failures ever before then, maybe once every few months), and has failed fairly consistently since then (40 failures in 21 days), so it is almost certainly a regression landed on 2013-07-31.

The patch in LU-3155 http://review.whamcloud.com/6025 would be a prime culprit, since it is one of the major changes to configuration that was landed at that time. It would be worthwhile to look through the test failures and find the latest commit that is common to all of them.

The below query finds all of the test_86 failures, then each one needs to go to the main "replay-single" test log, then the specific git commit hash for that test needs to be used to find the "parent" on which the patch was based. The latest common parent among all the failures (excepting possibly the failing patch itself) is the likely source of the regression.

https://maloo.whamcloud.com/sub_tests/query?commit=Update+results&page=2&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=fcadf0d2-32c3-11e0-a61c-52540025f9ae&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=f6a12204-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93

Comment by Peter Jones [ 26/Aug/13 ]

Bob

Could you please look into this one?

Thanks

Peter

Comment by Bob Glossman (Inactive) [ 28/Aug/13 ]

A couple of key features seen in all the instances found:
1) only seen in review-zfs test runs, not any in review
2) test_86 failure isn't the first failed test in replay-single. Always follows other failures, not always the same ones. subtests in test_58 and test_85 are most common.

based on 1) we might want to get a zfs expert looking at this.

I am continuing to try to narrow it down.

Comment by Bob Glossman (Inactive) [ 28/Aug/13 ]

Another fact; haven't been any instances seen at all since 8/19. wondering if the problem may have been fixed by a more recent commit.

Comment by Bob Glossman (Inactive) [ 28/Aug/13 ]

Surveying all the instances the latest common parent is the commit for LU-3155. That is the same one Andreas called out as suspicious in his comment. Evidence is stacking up that it's the cause.

Comment by Bob Glossman (Inactive) [ 04/Sep/13 ]

Since evidence suggests LU-3155 was the cause of this problem we need the author to comment.

Comment by Bob Glossman (Inactive) [ 04/Sep/13 ]

There is now at least one counterexample to the connection of this bug to zfs. At least 1 recent failure was seen in a review run: https://maloo.whamcloud.com/test_sets/5df51836-132a-11e3-8c44-52540035b04c

Still happens a lot more with zfs.

Comment by Artem Blagodarenko (Inactive) [ 05/Sep/13 ]

from replay-single.test_86.debug_log.client-26vm2.1378029549 :

00000100:02020000:0.0:1378029519.217418:0:22466:0:(client.c:1168:ptlrpc_check_status()) 11-0: MGC10.10.4.154@tcp: Communicating with 10.10.4.154@tcp, operation ldlm_enqueue failed with -107.
.. 

10000000:01000000:0.0:1378029519.217530:0:22466:0:(mgc_request.c:1849:mgc_process_log()) Can't get cfg lock: -107
10000000:01000000:0.0:1378029519.217535:0:22466:0:(mgc_request.c:1868:mgc_process_log()) MGC10.10.4.154@tcp: configuration from log 'lustre-client' failed (-5).

The test is called "replay-single test 86: umount server after clear nid_stats should not hit LBUG" so it looks like we unmount server before config file is processed.

Comment by Sarah Liu [ 25/Aug/15 ]

hit this error in interop testing between 2.7.0 server and master RHEL6.6 client:
https://testing.hpdd.intel.com/test_sets/bbeec3b0-454b-11e5-a64b-5254006e85c2

Comment by Sarah Liu [ 14/Sep/15 ]

another instance:

https://testing.hpdd.intel.com/test_sets/3d5c7066-5157-11e5-9f68-5254006e85c2

Generated at Sat Feb 10 01:37:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.