[LU-3798] replay-single test_86: configuration log errors Created: 20/Aug/13 Updated: 14/Dec/21 Resolved: 14/Dec/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0, Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Bob Glossman (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9816 | ||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ae12d972-092d-11e3-b004-52540035b04c. The sub-test test_86 failed with the following error:
Info required for matching: replay-single 86 Client console log: 02:47:43:if [ $running -eq 0 ] ; then 02:47:43: mkdir -p /mnt/lustre; 02:47:43: mount -t lustre -o user_xattr,acl,flock wtm-10vm7@tcp:/lustre /mnt/lustre; 02:47:43: rc=$?; 02:47:43:fi; 02:47:43:exit $rc 02:47:43:LustreError: 152-6: Ignoring deprecated mount option 'acl'. 02:47:43:LustreError: 15c-8: MGC10.10.16.120@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 02:47:43:LustreError: 9196:0:(llite_lib.c:1046:ll_fill_super()) Unable to process log: -5 02:47:43:Lustre: Unmounted lustre-client |
| Comments |
| Comment by Andreas Dilger [ 21/Aug/13 ] |
|
I suspect this is some kind of race between mounting the MDS/MGS (which is slow for some reason) and remounting the client but the MGS is not yet ready. We could have the client retry the mount a couple of times (with "-o retry=5" mount option on the client) to see if that solves the problem? |
| Comment by Andreas Dilger [ 21/Aug/13 ] |
|
It looks like this test only started failing regularly on July 31st (only 4 failures ever before then, maybe once every few months), and has failed fairly consistently since then (40 failures in 21 days), so it is almost certainly a regression landed on 2013-07-31. The patch in The below query finds all of the test_86 failures, then each one needs to go to the main "replay-single" test log, then the specific git commit hash for that test needs to be used to find the "parent" on which the patch was based. The latest common parent among all the failures (excepting possibly the failing patch itself) is the likely source of the regression. |
| Comment by Peter Jones [ 26/Aug/13 ] |
|
Bob Could you please look into this one? Thanks Peter |
| Comment by Bob Glossman (Inactive) [ 28/Aug/13 ] |
|
A couple of key features seen in all the instances found: based on 1) we might want to get a zfs expert looking at this. I am continuing to try to narrow it down. |
| Comment by Bob Glossman (Inactive) [ 28/Aug/13 ] |
|
Another fact; haven't been any instances seen at all since 8/19. wondering if the problem may have been fixed by a more recent commit. |
| Comment by Bob Glossman (Inactive) [ 28/Aug/13 ] |
|
Surveying all the instances the latest common parent is the commit for |
| Comment by Bob Glossman (Inactive) [ 04/Sep/13 ] |
|
Since evidence suggests |
| Comment by Bob Glossman (Inactive) [ 04/Sep/13 ] |
|
There is now at least one counterexample to the connection of this bug to zfs. At least 1 recent failure was seen in a review run: https://maloo.whamcloud.com/test_sets/5df51836-132a-11e3-8c44-52540035b04c Still happens a lot more with zfs. |
| Comment by Artem Blagodarenko (Inactive) [ 05/Sep/13 ] |
|
from replay-single.test_86.debug_log.client-26vm2.1378029549 : 00000100:02020000:0.0:1378029519.217418:0:22466:0:(client.c:1168:ptlrpc_check_status()) 11-0: MGC10.10.4.154@tcp: Communicating with 10.10.4.154@tcp, operation ldlm_enqueue failed with -107.
..
10000000:01000000:0.0:1378029519.217530:0:22466:0:(mgc_request.c:1849:mgc_process_log()) Can't get cfg lock: -107
10000000:01000000:0.0:1378029519.217535:0:22466:0:(mgc_request.c:1868:mgc_process_log()) MGC10.10.4.154@tcp: configuration from log 'lustre-client' failed (-5).
The test is called "replay-single test 86: umount server after clear nid_stats should not hit LBUG" so it looks like we unmount server before config file is processed. |
| Comment by Sarah Liu [ 25/Aug/15 ] |
|
hit this error in interop testing between 2.7.0 server and master RHEL6.6 client: |
| Comment by Sarah Liu [ 14/Sep/15 ] |
|
another instance: https://testing.hpdd.intel.com/test_sets/3d5c7066-5157-11e5-9f68-5254006e85c2 |