Of the 4 examples of Yu Jian,
1st
https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
03:11:29:
Press any key to continue.
03:11:29:
2nd
https://maloo.whamcloud.com/test_sets/8ec57d46-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
00:02:38:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1347174744/real 1347174747] req@ffff880076483000 x1412612131656402/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347174769 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
00:12:49:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 146 previous similar messages
00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1347175374/real 1347175374] req@ffff880076483000 x1412612131657436/t0(0) o250->MGC10.10.4.182@tcp@10.10.4.182@tcp:26/25 lens 400/544 e 0 to 1 dl 1347175399 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
00:23:01:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages
00:26:58:********** Timeout by autotest system **********00:28:03:
00:28:03:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 00:27.
00:28:24:
00:28:24:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 00:28.
00:28:24:
Press any key to continue.
00:28:24:
3rd
https://maloo.whamcloud.com/test_sets/dc68d638-fa73-11e1-887d-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29:<ConMan> Console [client-30vm6] connected to <client-30:6005> at 09-09 03:11.
03:11:29:
Press any key to continue.
4th
https://maloo.whamcloud.com/test_sessions/757ba820-fb85-11e1-8e05-52540035b04c
Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct.
02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836)
02:10:44:Lustre: DEBUG MARKER: /usr/sbin/lctl mark .\/compilebench -D \/mnt\/lustre\/d0.compilebench -i 2 -r 2 --makej
02:10:44:Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej
02:21:56:nfs: server client-30vm3 not responding, still trying
03:10:59:********** Timeout by autotest system **********03:11:08:
03:11:08:<ConMan> Console [client-30vm6] disconnected from <client-30:6005> at 09-09 03:11.
03:11:29:
03:11:29
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17
The issue still exists:
parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/f2b8c2b8-fc85-11e1-a4a6-52540035b04c
parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/d241d4ca-fc85-11e1-a4a6-52540035b04c
parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/3b4a8f4e-fc85-11e1-a4a6-52540035b04c
large-scale test_3a: https://maloo.whamcloud.com/test_sets/733bf24e-fc85-11e1-a4a6-52540035b04c
BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging.
However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to
LU-1906,LU-1909andLU-1863.So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs.