[LU-1926] Reboots during test runs Created: 30/Aug/12 Updated: 31/Jul/13 Resolved: 13/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Peter Jones | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | releases | ||
| Issue Links: |
|
||||||||
| Story Points: | 1 | ||||||||
| Business Value: | 0.1 | ||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 2220 | ||||||||
| Description |
|
See https://maloo.whamcloud.com/test_sessions/9ddf73ba-f23d-11e1-87d6-52540035b04c |
| Comments |
| Comment by Chris Gearing (Inactive) [ 30/Aug/12 ] |
|
I have no idea, we don't even know for sure if some other entity caused the reboots. Nothing in the logs suggests that autotest rebooted the nodes. We shall just have to monitor the situation. |
| Comment by Peter Jones [ 02/Sep/12 ] |
|
Didn't you isolate and fix this one? |
| Comment by Chris Gearing (Inactive) [ 03/Sep/12 ] |
|
Fixed by TT-852 and upgrade of the weekend 1/9/2012 |
| Comment by Sarah Liu [ 05/Sep/12 ] |
|
It looks like the MDS reboot for some reason |
| Comment by Chris Gearing (Inactive) [ 10/Sep/12 ] |
|
Could you provide details of where the reboot was, the traces are long and the test looks not particularly successful. As you have already found the unexpected MDS reboots in the logs it would be help to pass that information on to me. |
| Comment by Jian Yu [ 10/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/16 parallel-scale-nfsv4 test compilebench timed out: https://maloo.whamcloud.com/test_sets/f7a291b0-fa72-11e1-a03c-52540035b04c There is no error message in the test output. Console log on Client 2 (client-28vm2) showed that: 02:18:51:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:18:50 (1347182330) Console log on MDS (client-28vm7) showed that: 02:18:56:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:18:50 (1347182330) The MDS (client-28vm7) was not responding during the test, however there was no error message in its console log. Now, let's take a look at its syslong from /scratch/logs/syslog/client-28vm7.log: Sep 9 02:18:51 client-28vm7 kernel: Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.compilebench -i 2 -r 2 --makej As we can see, before the MDS node was rebooted on 03:21:09, there was no error message on both the console and syslog. The only culprit is the ntpd run on MDS node during the test. Will check more test reports associated with this ticket. |
| Comment by Jian Yu [ 10/Sep/12 ] |
|
More instances on b2_3 with the same issue: |
| Comment by Jian Yu [ 10/Sep/12 ] |
|
performance-sanity: https://maloo.whamcloud.com/test_sets/7c320096-fa73-11e1-887d-52540035b04c The above report showed that the MDS (client-30vm3) was out of service during the test. Its syslog showed that: Sep 8 20:30:43 client-30vm3 kernel: Lustre: DEBUG MARKER: ===== mdsrate-create-small.sh The MDS node was really rebooted with unknown reason during the test. |
| Comment by Chris Gearing (Inactive) [ 11/Sep/12 ] |
|
I'm looking at somemore that look similar, we now have stack traces etc. 1//// https://maloo.whamcloud.com/test_sessions/0b9ab03c-fbed-11e1-a4a6-52540035b04c Conf Sanity reboots MDS https://maloo.whamcloud.com/test_logs/7b2c562e-fbef-11e1-a4a6-52540035b04c/show_text but prior to the reboot we see an assertion 21:29:47:LustreError: 32529:0:(obd_class.h:527:obd_set_info_async()) obd_set_info_async: dev 0 no operation 2//// https://maloo.whamcloud.com/test_sessions/a41f27a2-fbde-11e1-bde2-52540035b04c same on conf-sanity https://maloo.whamcloud.com/test_logs/2e3dc9cc-fbe2-11e1-bde2-52540035b04c/show_text again a reboot, but an assertion before that 20:16:48:Lustre: MGS: Client 9711f524-c291-d894-f430-704c69e8e03d (at 0@lo) reconnecting same with the other failures this morning, I can't see why the previous did not have the LBUG but I can see any that replicate it. I'm going to create a ticket for the bugs I found this morning. |
| Comment by Chris Gearing (Inactive) [ 11/Sep/12 ] |
|
Of the 4 examples of Yu Jian, 1st Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) 2nd Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 00:02:38:Lustre: 2964:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 151 previous similar messages 3rd Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) 4th Timeouts correctly and reboot, the timeout message can be seen. Strangly the reboot is still captured after the timeout but the timeout is correct. 02:10:44:Lustre: DEBUG MARKER: == parallel-scale-nfsv4 test compilebench: compilebench == 02:10:36 (1347181836) |
| Comment by Chris Gearing (Inactive) [ 11/Sep/12 ] |
|
An issue did exist for a short while where the console log was not being shut before the report. This lead to the reboot appearing in the test log when the test was being correctly timedout. I'm closing this issue as this is I beleive not resolved. |
| Comment by Zhenyu Xu [ 11/Sep/12 ] |
|
we need collect all relevent logs before rebooting the node. Or else we know nothing about the issue. |
| Comment by Jian Yu [ 11/Sep/12 ] |
The two conf-sanity failures are caused by http://review.whamcloud.com/#change,3671 and http://review.whamcloud.com/#change,3670, which are still under development. The failure instances I reported in this ticket are on the main b2_3 branch. There is no such LBUG. |
| Comment by Chris Gearing (Inactive) [ 11/Sep/12 ] |
|
The failures are real timeouts is my point. The reboot happens after the test has timed out. and we do capture the logs, autotest was restarted because it had stopped capturing logs. |
| Comment by Jian Yu [ 11/Sep/12 ] |
|
Hi Chris, |
| Comment by Jian Yu [ 11/Sep/12 ] |
|
BTW, autotest did not gather syslogs for the above reports, I had to find out the syslogs on brent:/scratch/logs/syslog. |
| Comment by Jian Yu [ 12/Sep/12 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 The issue still exists: BTW, all of the syslogs in the above reports are empty. I checked the syslogs from brent node but still found nothing useful for debugging. However, comparing to the results on b2_3 build #16, although performance-saniy test_3 and sanity test_32n also hit MDS reboot issue, there are error messages on the MDS console logs on build #17 (no such messages on build #16), please refer to So, I'm not sure whether the above parallel-scale* and large-scale failures were caused by Lustre issues or not although there were no specific error messages on their logs. |
| Comment by Jian Yu [ 12/Sep/12 ] |
|
Oleg, Bobi, do you have any ideas about whether the above MDS reboot issue is a Lustre issue or test environment issue? Thanks. |
| Comment by Chris Gearing (Inactive) [ 13/Sep/12 ] |
|
From Skype [13/09/2012 08:21:29] Yu Jian aha, parallel-scale test compilebench hung in manual run, the MDS is being rebooted |
| Comment by Chris Gearing (Inactive) [ 13/Sep/12 ] |
|
Just assigning this to you for clarity. |
| Comment by Jian Yu [ 13/Sep/12 ] |
|
Chris, I'm still creating new LU tickets per the manual test runs. After all of the tests are done and the corresponding LU tickets are created, we can close this one. The remaining issue for TT is that the Oops on consoles were not gathered, and syslogs on Maloo were empty. |
| Comment by Jian Yu [ 13/Sep/12 ] |
|
After running those failed tests manually, it turned out to be Lustre failures: Lustre Build: http://build.whamcloud.com/job/lustre-b2_3/17 parallel-scale-nfsv4 test_compilebench: |