Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • 1
    • 0.1
    • 3
    • 2220

    Attachments

      Issue Links

        Activity

          [LU-1926] Reboots during test runs
          yujian Jian Yu added a comment -

          After running those failed tests manually, it turned out to be Lustre failures:

          Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

          parallel-scale-nfsv4 test_compilebench: LU-1928
          parallel-scale-nfsv3 test_compilebench: LU-1928
          parallel-scale test_compilebench: LU-1925
          large-scale test_3a: LU-1927
          performance-saniy test_3: LU-1906, LU-1909, LU-1929
          sanity test_32n: LU-1863
          Console log and syslog missing issue: TT-875

          yujian Jian Yu added a comment - After running those failed tests manually, it turned out to be Lustre failures: Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 parallel-scale-nfsv4 test_compilebench: LU-1928 parallel-scale-nfsv3 test_compilebench: LU-1928 parallel-scale test_compilebench: LU-1925 large-scale test_3a: LU-1927 performance-saniy test_3: LU-1906 , LU-1909 , LU-1929 sanity test_32n: LU-1863 Console log and syslog missing issue: TT-875
          yujian Jian Yu added a comment -

          Chris, I'm still creating new LU tickets per the manual test runs. After all of the tests are done and the corresponding LU tickets are created, we can close this one.

          The remaining issue for TT is that the Oops on consoles were not gathered, and syslogs on Maloo were empty.

          yujian Jian Yu added a comment - Chris, I'm still creating new LU tickets per the manual test runs. After all of the tests are done and the corresponding LU tickets are created, we can close this one. The remaining issue for TT is that the Oops on consoles were not gathered, and syslogs on Maloo were empty.

          Just assigning this to you for clarity.

          chris Chris Gearing (Inactive) added a comment - Just assigning this to you for clarity.

          From Skype

          [13/09/2012 08:21:29] Yu Jian

          aha, parallel-scale test compilebench hung in manual run, the MDS is being rebooted
          more logs occurr
          ok, Chris, TT-851 is not a test environment issue, it's a Lustre issue, I'll gather logs, file new ticket and update TT-851
          thank you, Chris

          chris Chris Gearing (Inactive) added a comment - From Skype [13/09/2012 08:21:29] Yu Jian aha, parallel-scale test compilebench hung in manual run, the MDS is being rebooted more logs occurr ok, Chris, TT-851 is not a test environment issue, it's a Lustre issue, I'll gather logs, file new ticket and update TT-851 thank you, Chris
          yujian Jian Yu added a comment -

          Oleg, Bobi, do you have any ideas about whether the above MDS reboot issue is a Lustre issue or test environment issue? Thanks.
          This is really blocking the Lustre b2_3 testing now.

          yujian Jian Yu added a comment - Oleg, Bobi, do you have any ideas about whether the above MDS reboot issue is a Lustre issue or test environment issue? Thanks. This is really blocking the Lustre b2_3 testing now.
          yujian Jian Yu added a comment -

          Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

          The issue still exists:
          parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/f2b8c2b8-fc85-11e1-a4a6-52540035b04c
          parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/d241d4ca-fc85-11e1-a4a6-52540035b04c
          parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/3b4a8f4e-fc85-11e1-a4a6-52540035b04c
          large-scale test_3a: https://maloo.whamcloud.com/test_sets/733bf24e-fc85-11e1-a4a6-52540035b04c

          BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging.

          However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to LU-1906, LU-1909 and LU-1863.

          So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs.

          yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 The issue still exists: parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/f2b8c2b8-fc85-11e1-a4a6-52540035b04c parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/d241d4ca-fc85-11e1-a4a6-52540035b04c parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/3b4a8f4e-fc85-11e1-a4a6-52540035b04c large-scale test_3a: https://maloo.whamcloud.com/test_sets/733bf24e-fc85-11e1-a4a6-52540035b04c BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging. However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to LU-1906 , LU-1909 and LU-1863 . So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs.
          yujian Jian Yu added a comment -

          BTW, autotest did not gather syslogs for the above reports, I had to find out the syslogs on brent:/scratch/logs/syslog.

          yujian Jian Yu added a comment - BTW, autotest did not gather syslogs for the above reports, I had to find out the syslogs on brent:/scratch/logs/syslog.
          yujian Jian Yu added a comment -

          Hi Chris,
          Please take a look at my above comments on the syslog of performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c, the MDS (client-30vm3) was rebooted before timeout.

          yujian Jian Yu added a comment - Hi Chris, Please take a look at my above comments on the syslog of performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c , the MDS (client-30vm3) was rebooted before timeout.

          The failures are real timeouts is my point. The reboot happens after the test has timed out.

          and we do capture the logs, autotest was restarted because it had stopped capturing logs.

          chris Chris Gearing (Inactive) added a comment - The failures are real timeouts is my point. The reboot happens after the test has timed out. and we do capture the logs, autotest was restarted because it had stopped capturing logs.
          yujian Jian Yu added a comment -

          same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it.

          The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670, which are still under development.

          The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG.

          yujian Jian Yu added a comment - same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it. The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670 , which are still under development. The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG.

          People

            yujian Jian Yu
            pjones Peter Jones
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: