Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • None
    • 1
    • 0.1
    • 3
    • 2220

    Attachments

      Issue Links

        Activity

          [LU-1926] Reboots during test runs
          yujian Jian Yu added a comment -

          Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17

          The issue still exists:
          parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/f2b8c2b8-fc85-11e1-a4a6-52540035b04c
          parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/d241d4ca-fc85-11e1-a4a6-52540035b04c
          parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/3b4a8f4e-fc85-11e1-a4a6-52540035b04c
          large-scale test_3a: https://maloo.whamcloud.com/test_sets/733bf24e-fc85-11e1-a4a6-52540035b04c

          BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging.

          However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to LU-1906, LU-1909 and LU-1863.

          So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs.

          yujian Jian Yu added a comment - Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 The issue still exists: parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/f2b8c2b8-fc85-11e1-a4a6-52540035b04c parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/d241d4ca-fc85-11e1-a4a6-52540035b04c parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/3b4a8f4e-fc85-11e1-a4a6-52540035b04c large-scale test_3a: https://maloo.whamcloud.com/test_sets/733bf24e-fc85-11e1-a4a6-52540035b04c BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging. However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to LU-1906 , LU-1909 and LU-1863 . So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs.
          yujian Jian Yu added a comment -

          BTW, autotest did not gather syslogs for the above reports, I had to find out the syslogs on brent:/scratch/logs/syslog.

          yujian Jian Yu added a comment - BTW, autotest did not gather syslogs for the above reports, I had to find out the syslogs on brent:/scratch/logs/syslog.
          yujian Jian Yu added a comment -

          Hi Chris,
          Please take a look at my above comments on the syslog of performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c, the MDS (client-30vm3) was rebooted before timeout.

          yujian Jian Yu added a comment - Hi Chris, Please take a look at my above comments on the syslog of performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c , the MDS (client-30vm3) was rebooted before timeout.

          The failures are real timeouts is my point. The reboot happens after the test has timed out.

          and we do capture the logs, autotest was restarted because it had stopped capturing logs.

          chris Chris Gearing (Inactive) added a comment - The failures are real timeouts is my point. The reboot happens after the test has timed out. and we do capture the logs, autotest was restarted because it had stopped capturing logs.
          yujian Jian Yu added a comment -

          same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it.

          The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670, which are still under development.

          The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG.

          yujian Jian Yu added a comment - same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it. The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670 , which are still under development. The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG.
          bobijam Zhenyu Xu added a comment -

          we need collect all relevent logs before rebooting the node. Or else we know nothing about the issue.

          bobijam Zhenyu Xu added a comment - we need collect all relevent logs before rebooting the node. Or else we know nothing about the issue.

          An issue did exist for a short while where the console log was not being shut before the report. This lead to the reboot appearing in the test log when the test was being correctly timedout.

          I'm closing this issue as this is I beleive not resolved.

          chris Chris Gearing (Inactive) added a comment - An issue did exist for a short while where the console log was not being shut before the report. This lead to the reboot appearing in the test log when the test was being correctly timedout. I'm closing this issue as this is I beleive not resolved.
          chris Chris Gearing (Inactive) added a comment - - edited

          Of the 4 examples of Yu Jian,

          1st
          https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c

          Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.

          02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
          02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
          02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
          02:21:56:nfs: server client-30vm3 not responding, still trying
          03:10:59:********** Timeout by autotest system **********03:11:08:
          03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
          03:11:29:
          03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
          03:11:29:
          Press any key to continue.
          03:11:29:

          2nd
          https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c

          Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.

          00:02:38:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
          00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347174744/real 1347174747] req@ffff880076483000 x1412612131656402/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347174769 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 146 previous similar messages
          00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347175374/real 1347175374] req@ffff880076483000 x1412612131657436/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347175399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
          00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
          00:26:58:********** Timeout by autotest system **********00:28:03:
          00:28:03:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 00:27.
          00:28:24:
          00:28:24:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 00:28.
          00:28:24:
          Press any key to continue.
          00:28:24:

          3rd
          https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c

          Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.

          02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
          02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
          02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
          02:21:56:nfs: server client-30vm3 not responding, still trying
          03:10:59:********** Timeout by autotest system **********03:11:08:
          03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
          03:11:29:
          03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
          03:11:29:
          Press any key to continue.

          4th
          https://maloo.whamcloud.com/test_sessions/757ba820-fb85-11e1-8e05-52540035b04c

          Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.

          02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
          02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
          02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
          02:21:56:nfs: server client-30vm3 not responding, still trying
          03:10:59:********** Timeout by autotest system **********03:11:08:
          03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
          03:11:29:
          03:11:29

          chris Chris Gearing (Inactive) added a comment - - edited Of the 4 examples of Yu Jian, 1st https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) 02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej 02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 02:21:56:nfs: server client-30vm3 not responding, still trying 03:10:59:********** Timeout by autotest system **********03:11:08: 03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11. 03:11:29: 03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11. 03:11:29: Press any key to continue. 03:11:29: 2nd https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 00:02:38:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages 00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347174744/real 1347174747] req@ffff880076483000 x1412612131656402/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347174769 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 146 previous similar messages 00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347175374/real 1347175374] req@ffff880076483000 x1412612131657436/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347175399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages 00:26:58:********** Timeout by autotest system **********00:28:03: 00:28:03:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 00:27. 00:28:24: 00:28:24:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 00:28. 00:28:24: Press any key to continue. 00:28:24: 3rd https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) 02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej 02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 02:21:56:nfs: server client-30vm3 not responding, still trying 03:10:59:********** Timeout by autotest system **********03:11:08: 03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11. 03:11:29: 03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11. 03:11:29: Press any key to continue. 4th https://maloo.whamcloud.com/test_sessions/757ba820-fb85-11e1-8e05-52540035b04c Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) 02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej 02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej 02:21:56:nfs: server client-30vm3 not responding, still trying 03:10:59:********** Timeout by autotest system **********03:11:08: 03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11. 03:11:29: 03:11:29

          I'm looking at somemore that look similar, we now have stack traces etc.

          1////

          https://maloo.whamcloud.com/test_sessions/0b9ab03c-fbed-11e1-a4a6-52540035b04c

          Conf Sanity reboots MDS

          https://maloo.whamcloud.com/test_logs/7b2c562e-fbef-11e1-a4a6-52540035b04c/show_text

          but prior to the reboot we see an assertion

          21:29:47:LustreError: 32529:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation
          21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed:
          21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) LBUG
          21:29:48:Pid: 402, comm: llog_process_th

          2////

          https://maloo.whamcloud.com/test_sessions/a41f27a2-fbde-11e1-bde2-52540035b04c

          same on conf-sanity

          https://maloo.whamcloud.com/test_logs/2e3dc9cc-fbe2-11e1-bde2-52540035b04c/show_text

          again a reboot, but an assertion before that

          20:16:48:Lustre: MGS: Client 9711f524-c291-d894-f430-704c69e8e03d (at 0@lo) reconnecting
          20:16:48:LustreError: 32139:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation
          20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed:
          20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) LBUG
          20:16:48:Pid: 32478, comm: llog_process_th

          same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it.

          I'm going to create a ticket for the bugs I found this morning.

          chris Chris Gearing (Inactive) added a comment - I'm looking at somemore that look similar, we now have stack traces etc. 1//// https://maloo.whamcloud.com/test_sessions/0b9ab03c-fbed-11e1-a4a6-52540035b04c Conf Sanity reboots MDS https://maloo.whamcloud.com/test_logs/7b2c562e-fbef-11e1-a4a6-52540035b04c/show_text but prior to the reboot we see an assertion 21:29:47:LustreError: 32529:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation 21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed: 21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) LBUG 21:29:48:Pid: 402, comm: llog_process_th 2//// https://maloo.whamcloud.com/test_sessions/a41f27a2-fbde-11e1-bde2-52540035b04c same on conf-sanity https://maloo.whamcloud.com/test_logs/2e3dc9cc-fbe2-11e1-bde2-52540035b04c/show_text again a reboot, but an assertion before that 20:16:48:Lustre: MGS: Client 9711f524-c291-d894-f430-704c69e8e03d (at 0@lo) reconnecting 20:16:48:LustreError: 32139:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation 20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed: 20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) LBUG 20:16:48:Pid: 32478, comm: llog_process_th same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it. I'm going to create a ticket for the bugs I found this morning.
          yujian Jian Yu added a comment -

          performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c

          The above report showed that the MDS (client-30vm3) was out of service during the test. Its syslog showed that:

          Sep 8 20:30:43 client-30vm3 kernel: Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
          Sep 8 20:30:43 client-30vm3 rshd[23422]: pam_unix(rsh:session): session closed for user root
          Sep 8 20:30:43 client-30vm3 xinetd[1549]: EXIT: shell status=0 pid=23422 duration=0(sec)
          Sep 8 20:31:33 client-30vm3 kernel: MTRR variable ranges enabled:
          Sep 8 20:31:33 client-30vm3 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
          Sep 8 20:31:33 client-30vm3 kernel: 1 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 2 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 3 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 4 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 5 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 6 disabled
          Sep 8 20:31:33 client-30vm3 kernel: 7 disabled
          Sep 8 20:31:33 client-30vm3 kernel: PAT not supported by CPU.
          Sep 8 20:31:33 client-30vm3 kernel: initial memory mapped : 0 - 20000000

          The MDS node was really rebooted with unknown reason during the test.

          yujian Jian Yu added a comment - performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c The above report showed that the MDS (client-30vm3) was out of service during the test. Its syslog showed that: Sep 8 20:30:43 client-30vm3 kernel: Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh Sep 8 20:30:43 client-30vm3 rshd [23422] : pam_unix(rsh:session): session closed for user root Sep 8 20:30:43 client-30vm3 xinetd [1549] : EXIT: shell status=0 pid=23422 duration=0(sec) Sep 8 20:31:33 client-30vm3 kernel: MTRR variable ranges enabled: Sep 8 20:31:33 client-30vm3 kernel: 0 base 00E0000000 mask FFE0000000 uncachable Sep 8 20:31:33 client-30vm3 kernel: 1 disabled Sep 8 20:31:33 client-30vm3 kernel: 2 disabled Sep 8 20:31:33 client-30vm3 kernel: 3 disabled Sep 8 20:31:33 client-30vm3 kernel: 4 disabled Sep 8 20:31:33 client-30vm3 kernel: 5 disabled Sep 8 20:31:33 client-30vm3 kernel: 6 disabled Sep 8 20:31:33 client-30vm3 kernel: 7 disabled Sep 8 20:31:33 client-30vm3 kernel: PAT not supported by CPU. Sep 8 20:31:33 client-30vm3 kernel: initial memory mapped : 0 - 20000000 The MDS node was really rebooted with unknown reason during the test.
          yujian Jian Yu added a comment - More instances on b2_3 with the same issue: https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_sets/891ac004-fa73-11e1-887d-52540035b04c https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c

          People

            yujian Jian Yu
            pjones Peter Jones
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: