[LU-6883] replay-single test 73a hang and timeout: umount Created: 20/Jul/15 Updated: 26/Aug/15 Resolved: 25/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
review-dne-part-2 in autotest |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
replay-single test 73a is hanging on umount and timesout. Logs for these failures all take place during review-dne-part-2 and are at: On the client console, we have an error: 14:37:16:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616) 14:37:16:Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) 14:37:16:Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi 14:37:16:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 14:37:16:LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d86400: operation obd_ping to node 10.1.5.72@tcp failed: rc = -107 Trouble seems to be on the MDS1: 13:37:27:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616) 13:37:27:Lustre: DEBUG MARKER: sync; sync; sync 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly 13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) *** setting lustre-MDT0000 read-only *** 13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) Skipped 2 previous similar messages 13:37:27:Turning device dm-0 (0xfd00000) read-only 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 13:37:27:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 13:37:27:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000302 13:37:27:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts 13:37:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1 13:37:27:Lustre: Failing over lustre-MDT0000 13:37:27:Lustre: Skipped 6 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 7 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping) 13:37:27:Lustre: Skipped 5 previous similar messages 13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages 13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages 13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5 13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages 13:37:27:LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) 13:37:27:Lustre: Skipped 7 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 1 previous similar message 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 15 previous similar messages 14:37:08:********** Timeout by autotest system ********** From the MDS syslog: Jul 18 13:36:58 shadow-26vm12 mrshd[14211]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?' Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: DEBUG MARKER: umount -d /mnt/mds1 Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: EXIT: mshell status=0 pid=14210 duration=0(sec) Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: START: mshell pid=14233 from=::ffff:10.1.5.70 Jul 18 13:36:58 shadow-26vm12 mrshd[14234]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?' Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Failing over lustre-MDT0000 Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Skipped 6 previous similar messages Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping) Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 5 previous similar messages Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5 Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5 Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: Skipped 1 previous similar message Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: Skipped 15 previous similar messages Jul 18 13:37:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: Skipped 29 previous similar messages Jul 18 13:37:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: Skipped 49 previous similar messages Jul 18 13:37:45 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 13:38:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: Skipped 109 previous similar messages Jul 18 13:39:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: Skipped 207 previous similar messages Jul 18 13:39:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 13:41:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: Skipped 415 previous similar messages Jul 18 13:45:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: Skipped 817 previous similar messages Jul 18 13:54:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: Skipped 1645 previous similar messages Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.70@tcp (stopping) Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: Skipped 1920 previous similar messages Jul 18 14:11:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck? Jul 18 14:12:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 14:14:22 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 21/Jul/15 ] |
|
Another occurrences: https://testing.hpdd.intel.com/test_sessions/08854c48-2ce8-11e5-804c-5254006e85c2 |
| Comment by James Nunez (Inactive) [ 21/Jul/15 ] |
|
More failures for master during review-dne-part-2: |
| Comment by Bob Glossman (Inactive) [ 29/Jul/15 ] |
|
another: |
| Comment by Jian Yu [ 30/Jul/15 ] |
|
More failure instance on master branch: |
| Comment by James Nunez (Inactive) [ 30/Jul/15 ] |
|
A couple more: |
| Comment by Peter Jones [ 30/Jul/15 ] |
|
Lai Could you please look into this failure? This is causing several review failures per day atm Peter |
| Comment by Andreas Dilger [ 07/Aug/15 ] |
|
Lai, any update on this issue? |
| Comment by Lai Siyao [ 11/Aug/15 ] |
|
I couldn't reproduce in local test system, and because this is a timeout issue, the debug log doesn't contain any useful information. I've asked Di to help look into it. |
| Comment by Joseph Gmitter (Inactive) [ 25/Aug/15 ] |
|
Hi Di, |
| Comment by Di Wang [ 25/Aug/15 ] |
|
Hmm, there are not enough information for me there to know the exact reason. But it looks like this failures is caused by something in replay-single.sh 70b, and probably the recent fixes from |
| Comment by Peter Jones [ 25/Aug/15 ] |
|
Thanks Di! |