Details
-
Bug
-
Resolution: Cannot Reproduce
-
Blocker
-
None
-
Lustre 2.8.0
-
None
-
review-dne-part-2 in autotest
-
3
-
9223372036854775807
Description
replay-single test 73a is hanging on umount and timesout. Logs for these failures all take place during review-dne-part-2 and are at:
2015-07-17 16:05:47 - https://testing.hpdd.intel.com/test_sets/0c6e08a4-2ce8-11e5-804c-5254006e85c2
2015-07-17 20:34:00 - https://testing.hpdd.intel.com/test_sets/9ce10940-2d10-11e5-b883-5254006e85c2
2015-07-18 11:21:11 - https://testing.hpdd.intel.com/test_sets/de25482e-2d8a-11e5-a112-5254006e85c2
2015-07-18 12:32:17 - https://testing.hpdd.intel.com/test_sets/cf00fd84-2d8f-11e5-b883-5254006e85c2
2015-07-18 21:50:59 - https://testing.hpdd.intel.com/test_sets/7975eb6c-2dea-11e5-a0dd-5254006e85c2
On the client console, we have an error:
14:37:16:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616) 14:37:16:Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) 14:37:16:Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi 14:37:16:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000 14:37:16:LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d86400: operation obd_ping to node 10.1.5.72@tcp failed: rc = -107
Trouble seems to be on the MDS1:
13:37:27:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616) 13:37:27:Lustre: DEBUG MARKER: sync; sync; sync 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly 13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) *** setting lustre-MDT0000 read-only *** 13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) Skipped 2 previous similar messages 13:37:27:Turning device dm-0 (0xfd00000) read-only 13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 13:37:27:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 13:37:27:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000302 13:37:27:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts 13:37:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1 13:37:27:Lustre: Failing over lustre-MDT0000 13:37:27:Lustre: Skipped 6 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 7 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping) 13:37:27:Lustre: Skipped 5 previous similar messages 13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages 13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages 13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5 13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages 13:37:27:LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) 13:37:27:Lustre: Skipped 7 previous similar messages 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 1 previous similar message 13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) 13:37:27:Lustre: Skipped 15 previous similar messages 14:37:08:********** Timeout by autotest system **********
From the MDS syslog:
Jul 18 13:36:58 shadow-26vm12 mrshd[14211]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?' Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: DEBUG MARKER: umount -d /mnt/mds1 Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: EXIT: mshell status=0 pid=14210 duration=0(sec) Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: START: mshell pid=14233 from=::ffff:10.1.5.70 Jul 18 13:36:58 shadow-26vm12 mrshd[14234]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?' Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Failing over lustre-MDT0000 Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Skipped 6 previous similar messages Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping) Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 5 previous similar messages Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5 Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5 Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: Skipped 1 previous similar message Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: Skipped 15 previous similar messages Jul 18 13:37:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: Skipped 29 previous similar messages Jul 18 13:37:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: Skipped 49 previous similar messages Jul 18 13:37:45 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 13:38:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: Skipped 109 previous similar messages Jul 18 13:39:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: Skipped 207 previous similar messages Jul 18 13:39:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 13:41:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: Skipped 415 previous similar messages Jul 18 13:45:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping) Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: Skipped 817 previous similar messages Jul 18 13:54:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck? Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping) Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: Skipped 1645 previous similar messages Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.70@tcp (stopping) Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: Skipped 1920 previous similar messages Jul 18 14:11:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck? Jul 18 14:12:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied Jul 18 14:14:22 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)