Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3798

replay-single test_86: configuration log errors

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.5.0, Lustre 2.8.0
    • 3
    • 9816

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/ae12d972-092d-11e3-b004-52540035b04c.

      The sub-test test_86 failed with the following error:

      test_86 failed with 5

      Info required for matching: replay-single 86

      Client console log:

      02:47:43:if [ $running -eq 0 ] ; then
      02:47:43:    mkdir -p /mnt/lustre;
      02:47:43:    mount -t lustre -o user_xattr,acl,flock wtm-10vm7@tcp:/lustre /mnt/lustre;
      02:47:43:    rc=$?;
      02:47:43:fi;
      02:47:43:exit $rc
      02:47:43:LustreError: 152-6: Ignoring deprecated mount option 'acl'.
      02:47:43:LustreError: 15c-8: MGC10.10.16.120@tcp: The configuration from log 'lustre-client' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
      02:47:43:LustreError: 9196:0:(llite_lib.c:1046:ll_fill_super()) Unable to process log: -5
      02:47:43:Lustre: Unmounted lustre-client
      

      Attachments

        Issue Links

          Activity

            [LU-3798] replay-single test_86: configuration log errors
            sarah Sarah Liu added a comment - another instance: https://testing.hpdd.intel.com/test_sets/3d5c7066-5157-11e5-9f68-5254006e85c2
            sarah Sarah Liu added a comment -

            hit this error in interop testing between 2.7.0 server and master RHEL6.6 client:
            https://testing.hpdd.intel.com/test_sets/bbeec3b0-454b-11e5-a64b-5254006e85c2

            sarah Sarah Liu added a comment - hit this error in interop testing between 2.7.0 server and master RHEL6.6 client: https://testing.hpdd.intel.com/test_sets/bbeec3b0-454b-11e5-a64b-5254006e85c2

            from replay-single.test_86.debug_log.client-26vm2.1378029549 :

            00000100:02020000:0.0:1378029519.217418:0:22466:0:(client.c:1168:ptlrpc_check_status()) 11-0: MGC10.10.4.154@tcp: Communicating with 10.10.4.154@tcp, operation ldlm_enqueue failed with -107.
            .. 
            
            10000000:01000000:0.0:1378029519.217530:0:22466:0:(mgc_request.c:1849:mgc_process_log()) Can't get cfg lock: -107
            10000000:01000000:0.0:1378029519.217535:0:22466:0:(mgc_request.c:1868:mgc_process_log()) MGC10.10.4.154@tcp: configuration from log 'lustre-client' failed (-5).
            

            The test is called "replay-single test 86: umount server after clear nid_stats should not hit LBUG" so it looks like we unmount server before config file is processed.

            artem_blagodarenko Artem Blagodarenko (Inactive) added a comment - - edited from replay-single.test_86.debug_log.client-26vm2.1378029549 : 00000100:02020000:0.0:1378029519.217418:0:22466:0:(client.c:1168:ptlrpc_check_status()) 11-0: MGC10.10.4.154@tcp: Communicating with 10.10.4.154@tcp, operation ldlm_enqueue failed with -107. .. 10000000:01000000:0.0:1378029519.217530:0:22466:0:(mgc_request.c:1849:mgc_process_log()) Can't get cfg lock: -107 10000000:01000000:0.0:1378029519.217535:0:22466:0:(mgc_request.c:1868:mgc_process_log()) MGC10.10.4.154@tcp: configuration from log 'lustre-client' failed (-5). The test is called "replay-single test 86: umount server after clear nid_stats should not hit LBUG" so it looks like we unmount server before config file is processed.

            There is now at least one counterexample to the connection of this bug to zfs. At least 1 recent failure was seen in a review run: https://maloo.whamcloud.com/test_sets/5df51836-132a-11e3-8c44-52540035b04c

            Still happens a lot more with zfs.

            bogl Bob Glossman (Inactive) added a comment - There is now at least one counterexample to the connection of this bug to zfs. At least 1 recent failure was seen in a review run: https://maloo.whamcloud.com/test_sets/5df51836-132a-11e3-8c44-52540035b04c Still happens a lot more with zfs.
            bogl Bob Glossman (Inactive) added a comment - - edited

            Since evidence suggests LU-3155 was the cause of this problem we need the author to comment.

            bogl Bob Glossman (Inactive) added a comment - - edited Since evidence suggests LU-3155 was the cause of this problem we need the author to comment.

            Surveying all the instances the latest common parent is the commit for LU-3155. That is the same one Andreas called out as suspicious in his comment. Evidence is stacking up that it's the cause.

            bogl Bob Glossman (Inactive) added a comment - Surveying all the instances the latest common parent is the commit for LU-3155 . That is the same one Andreas called out as suspicious in his comment. Evidence is stacking up that it's the cause.
            bogl Bob Glossman (Inactive) added a comment - - edited

            Another fact; haven't been any instances seen at all since 8/19. wondering if the problem may have been fixed by a more recent commit.

            bogl Bob Glossman (Inactive) added a comment - - edited Another fact; haven't been any instances seen at all since 8/19. wondering if the problem may have been fixed by a more recent commit.

            A couple of key features seen in all the instances found:
            1) only seen in review-zfs test runs, not any in review
            2) test_86 failure isn't the first failed test in replay-single. Always follows other failures, not always the same ones. subtests in test_58 and test_85 are most common.

            based on 1) we might want to get a zfs expert looking at this.

            I am continuing to try to narrow it down.

            bogl Bob Glossman (Inactive) added a comment - A couple of key features seen in all the instances found: 1) only seen in review-zfs test runs, not any in review 2) test_86 failure isn't the first failed test in replay-single. Always follows other failures, not always the same ones. subtests in test_58 and test_85 are most common. based on 1) we might want to get a zfs expert looking at this. I am continuing to try to narrow it down.
            pjones Peter Jones added a comment -

            Bob

            Could you please look into this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Bob Could you please look into this one? Thanks Peter

            It looks like this test only started failing regularly on July 31st (only 4 failures ever before then, maybe once every few months), and has failed fairly consistently since then (40 failures in 21 days), so it is almost certainly a regression landed on 2013-07-31.

            The patch in LU-3155 http://review.whamcloud.com/6025 would be a prime culprit, since it is one of the major changes to configuration that was landed at that time. It would be worthwhile to look through the test failures and find the latest commit that is common to all of them.

            The below query finds all of the test_86 failures, then each one needs to go to the main "replay-single" test log, then the specific git commit hash for that test needs to be used to find the "parent" on which the patch was based. The latest common parent among all the failures (excepting possibly the failing patch itself) is the likely source of the regression.

            https://maloo.whamcloud.com/sub_tests/query?commit=Update+results&page=2&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=fcadf0d2-32c3-11e0-a61c-52540025f9ae&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=f6a12204-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93

            adilger Andreas Dilger added a comment - It looks like this test only started failing regularly on July 31st (only 4 failures ever before then, maybe once every few months), and has failed fairly consistently since then (40 failures in 21 days), so it is almost certainly a regression landed on 2013-07-31. The patch in LU-3155 http://review.whamcloud.com/6025 would be a prime culprit, since it is one of the major changes to configuration that was landed at that time. It would be worthwhile to look through the test failures and find the latest commit that is common to all of them. The below query finds all of the test_86 failures, then each one needs to go to the main "replay-single" test log, then the specific git commit hash for that test needs to be used to find the "parent" on which the patch was based. The latest common parent among all the failures (excepting possibly the failing patch itself) is the likely source of the regression. https://maloo.whamcloud.com/sub_tests/query?commit=Update+results&page=2&sub_test[query_bugs]=&sub_test[status]=FAIL&sub_test[sub_test_script_id]=fcadf0d2-32c3-11e0-a61c-52540025f9ae&test_node[architecture_type_id]=&test_node[distribution_type_id]=&test_node[file_system_type_id]=&test_node[lustre_branch_id]=24a6947e-04a9-11e1-bb5f-52540025f9af&test_node[os_type_id]=&test_node_network[network_type_id]=&test_session[query_date]=&test_session[query_recent_period]=2419200&test_session[test_group]=&test_session[test_host]=&test_session[user_id]=&test_set[test_set_script_id]=f6a12204-32c3-11e0-a61c-52540025f9ae&utf8=%E2%9C%93

            People

              bogl Bob Glossman (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: