Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6883

replay-single test 73a hang and timeout: umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.8.0
    • None
    • review-dne-part-2 in autotest
    • 3
    • 9223372036854775807

    Description

      replay-single test 73a is hanging on umount and timesout. Logs for these failures all take place during review-dne-part-2 and are at:
      2015-07-17 16:05:47 - https://testing.hpdd.intel.com/test_sets/0c6e08a4-2ce8-11e5-804c-5254006e85c2
      2015-07-17 20:34:00 - https://testing.hpdd.intel.com/test_sets/9ce10940-2d10-11e5-b883-5254006e85c2
      2015-07-18 11:21:11 - https://testing.hpdd.intel.com/test_sets/de25482e-2d8a-11e5-a112-5254006e85c2
      2015-07-18 12:32:17 - https://testing.hpdd.intel.com/test_sets/cf00fd84-2d8f-11e5-b883-5254006e85c2
      2015-07-18 21:50:59 - https://testing.hpdd.intel.com/test_sets/7975eb6c-2dea-11e5-a0dd-5254006e85c2

      On the client console, we have an error:

      14:37:16:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616)
      14:37:16:Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
      14:37:16:Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
      14:37:16:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
      14:37:16:LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d86400: operation obd_ping to node 10.1.5.72@tcp failed: rc = -107
      

      Trouble seems to be on the MDS1:

      13:37:27:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616)
      13:37:27:Lustre: DEBUG MARKER: sync; sync; sync
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) *** setting lustre-MDT0000 read-only ***
      13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) Skipped 2 previous similar messages
      13:37:27:Turning device dm-0 (0xfd00000) read-only
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      13:37:27:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      13:37:27:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000302
      13:37:27:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
      13:37:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1
      13:37:27:Lustre: Failing over lustre-MDT0000
      13:37:27:Lustre: Skipped 6 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 7 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping)
      13:37:27:Lustre: Skipped 5 previous similar messages
      13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5
      13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages
      13:37:27:LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      13:37:27:Lustre: Skipped 7 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 1 previous similar message
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 15 previous similar messages
      14:37:08:********** Timeout by autotest system **********
      

      From the MDS syslog:

      Jul 18 13:36:58 shadow-26vm12 mrshd[14211]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?'
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: DEBUG MARKER: umount -d /mnt/mds1
      Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: EXIT: mshell status=0 pid=14210 duration=0(sec)
      Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: START: mshell pid=14233 from=::ffff:10.1.5.70
      Jul 18 13:36:58 shadow-26vm12 mrshd[14234]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?'
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Failing over lustre-MDT0000
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Skipped 6 previous similar messages
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping)
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 5 previous similar messages
      Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5
      Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages
      Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: Skipped 1 previous similar message
      Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: Skipped 15 previous similar messages
      Jul 18 13:37:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: Skipped 29 previous similar messages
      Jul 18 13:37:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: Skipped 49 previous similar messages
      Jul 18 13:37:45 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 13:38:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: Skipped 109 previous similar messages
      Jul 18 13:39:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: Skipped 207 previous similar messages
      Jul 18 13:39:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 13:41:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: Skipped 415 previous similar messages
      Jul 18 13:45:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: Skipped 817 previous similar messages
      Jul 18 13:54:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: Skipped 1645 previous similar messages
      Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.70@tcp (stopping)
      Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: Skipped 1920 previous similar messages
      Jul 18 14:11:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 14:12:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 14:14:22 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: