Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
None
-
1
-
0.1
-
3
-
2220
Description
Attachments
Issue Links
- is duplicated by
-
LU-1863 Test failure with MDS spontaneous rebooting (test suite sanity, subtest test_32n)
-
- Resolved
-
Activity
The failures are real timeouts is my point. The reboot happens after the test has timed out.
and we do capture the logs, autotest was restarted because it had stopped capturing logs.
same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it.
The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670, which are still under development.
The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG.
we need collect all relevent logs before rebooting the node. Or else we know nothing about the issue.
An issue did exist for a short while where the console log was not being shut before the report. This lead to the reboot appearing in the test log when the test was being correctly timedout.
I'm closing this issue as this is I beleive not resolved.
Of the 4 examples of Yu Jian,
1st
https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
03:11:29:
Press any key to continue.
03:11:29:
2nd
https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
00:02:38:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347174744/real 1347174747] req@ffff880076483000 x1412612131656402/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347174769 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 146 previous similar messages
00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347175374/real 1347175374] req@ffff880076483000 x1412612131657436/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347175399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
00:26:58:********** Timeout by autotest system **********00:28:03:
00:28:03:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 00:27.
00:28:24:
00:28:24:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 00:28.
00:28:24:
Press any key to continue.
00:28:24:
3rd
https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
03:11:29:
Press any key to continue.
4th
https://maloo.whamcloud.com/test_sessions/757ba820-fb85-11e1-8e05-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29
I'm looking at somemore that look similar, we now have stack traces etc.
1////
https://maloo.whamcloud.com/test_sessions/0b9ab03c-fbed-11e1-a4a6-52540035b04c
Conf Sanity reboots MDS
https://maloo.whamcloud.com/test_logs/7b2c562e-fbef-11e1-a4a6-52540035b04c/show_text
but prior to the reboot we see an assertion
21:29:47:LustreError: 32529:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation
21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed:
21:29:48:LustreError: 402:0:(mgs_internal.h:283:mgs_env_info()) LBUG
21:29:48:Pid: 402, comm: llog_process_th
2////
https://maloo.whamcloud.com/test_sessions/a41f27a2-fbde-11e1-bde2-52540035b04c
same on conf-sanity
https://maloo.whamcloud.com/test_logs/2e3dc9cc-fbe2-11e1-bde2-52540035b04c/show_text
again a reboot, but an assertion before that
20:16:48:Lustre: MGS: Client 9711f524-c291-d894-f430-704c69e8e03d (at 0@lo) reconnecting
20:16:48:LustreError: 32139:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation
20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) ASSERTION( info != ((void *)0) ) failed:
20:16:48:LustreError: 32478:0:(mgs_internal.h:288:mgs_env_info()) LBUG
20:16:48:Pid: 32478, comm: llog_process_th
same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it.
I'm going to create a ticket for the bugs I found this morning.
performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c
The above report showed that the MDS (client-30vm3) was out of service during the test. Its syslog showed that:
Sep 8 20:30:43 client-30vm3 kernel: Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh
Sep 8 20:30:43 client-30vm3 rshd[23422]: pam_unix(rsh:session): session closed for user root
Sep 8 20:30:43 client-30vm3 xinetd[1549]: EXIT: shell status=0 pid=23422 duration=0(sec)
Sep 8 20:31:33 client-30vm3 kernel: MTRR variable ranges enabled:
Sep 8 20:31:33 client-30vm3 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Sep 8 20:31:33 client-30vm3 kernel: 1 disabled
Sep 8 20:31:33 client-30vm3 kernel: 2 disabled
Sep 8 20:31:33 client-30vm3 kernel: 3 disabled
Sep 8 20:31:33 client-30vm3 kernel: 4 disabled
Sep 8 20:31:33 client-30vm3 kernel: 5 disabled
Sep 8 20:31:33 client-30vm3 kernel: 6 disabled
Sep 8 20:31:33 client-30vm3 kernel: 7 disabled
Sep 8 20:31:33 client-30vm3 kernel: PAT not supported by CPU.
Sep 8 20:31:33 client-30vm3 kernel: initial memory mapped : 0 - 20000000
The MDS node was really rebooted with unknown reason during the test.
More instances on b2_3 with the same issue:
https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_sets/891ac004-fa73-11e1-887d-52540035b04c
https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/16
parallel-scale-nfsv4 test compilebench timed out: https://maloo.whamcloud.com/test_sets/f7a291b0-fa72-11e1-a03c-52540035b04c
There is no error message in the test output.
Console log on Client 2 (client-28vm2) showed that:
02:18:51:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:18:50 (1347182330)
02:18:51:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:18:51:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:30:43:nfs: server client-28vm7 not responding, still trying
03:18:59:********** Timeout by autotest system **********03:19:54:
Console log on MDS (client-28vm7) showed that:
02:18:56:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:18:50 (1347182330)
02:18:56:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:mdt
02:18:56:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:18:56:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
03:18:59:********** Timeout by autotest system **********03:20:39:
The MDS (client-28vm7) was not responding during the test, however there was no error message in its console log. Now, let's take a look at its syslong from /scratch/logs/syslog/client-28vm7.log:
Sep 9 02:18:51 client-28vm7 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
Sep 9 02:18:51 client-28vm7 rshd[5258]: pam_unix(rsh:session): session closed for user root
Sep 9 02:18:51 client-28vm7 xinetd[1661]: EXIT: shell status=0 pid=5258 duration=0(sec)
Sep 9 02:19:38 client-28vm7 ntpd[1669]: synchronized to 10.10.0.2, stratum 3
Sep 9 02:19:38 client-28vm7 ntpd[1669]: time reset -0.393295 s
Sep 9 02:19:38 client-28vm7 ntpd[1669]: kernel time sync status change 2001
Sep 9 02:24:25 client-28vm7 ntpd[1669]: synchronized to 10.10.0.1, stratum 2
Sep 9 03:21:09 client-28vm7 kernel: MTRR variable ranges enabled:
Sep 9 03:21:09 client-28vm7 kernel: 0 base 00E0000000 mask FFE0000000 uncachable
Sep 9 03:21:09 client-28vm7 kernel: 1 disabled
Sep 9 03:21:09 client-28vm7 kernel: 2 disabled
Sep 9 03:21:09 client-28vm7 kernel: 3 disabled
Sep 9 03:21:09 client-28vm7 kernel: 4 disabled
Sep 9 03:21:09 client-28vm7 kernel: 5 disabled
Sep 9 03:21:09 client-28vm7 kernel: 6 disabled
Sep 9 03:21:09 client-28vm7 kernel: 7 disabled
Sep 9 03:21:09 client-28vm7 kernel: PAT not supported by CPU.
Sep 9 03:21:09 client-28vm7 kernel: initial memory mapped : 0 - 20000000
As we can see, before the MDS node was rebooted on 03:21:09, there was no error message on both the console and syslog. The only culprit is the ntpd run on MDS node during the test. Will check more test reports associated with this ticket.
Could you provide details of where the reboot was, the traces are long and the test looks not particularly successful. As you have already found the unexpected MDS reboots in the logs it would be help to pass that information on to me.
Hi Chris,
Please take a look at my above comments on the syslog of performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c, the MDS (client-30vm3) was rebooted before timeout.