Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6883

replay-single test 73a hang and timeout: umount

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.8.0
    • None
    • review-dne-part-2 in autotest
    • 3
    • 9223372036854775807

    Description

      replay-single test 73a is hanging on umount and timesout. Logs for these failures all take place during review-dne-part-2 and are at:
      2015-07-17 16:05:47 - https://testing.hpdd.intel.com/test_sets/0c6e08a4-2ce8-11e5-804c-5254006e85c2
      2015-07-17 20:34:00 - https://testing.hpdd.intel.com/test_sets/9ce10940-2d10-11e5-b883-5254006e85c2
      2015-07-18 11:21:11 - https://testing.hpdd.intel.com/test_sets/de25482e-2d8a-11e5-a112-5254006e85c2
      2015-07-18 12:32:17 - https://testing.hpdd.intel.com/test_sets/cf00fd84-2d8f-11e5-b883-5254006e85c2
      2015-07-18 21:50:59 - https://testing.hpdd.intel.com/test_sets/7975eb6c-2dea-11e5-a0dd-5254006e85c2

      On the client console, we have an error:

      14:37:16:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616)
      14:37:16:Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
      14:37:16:Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
      14:37:16:Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
      14:37:16:LustreError: 11-0: lustre-MDT0000-mdc-ffff880037d86400: operation obd_ping to node 10.1.5.72@tcp failed: rc = -107
      

      Trouble seems to be on the MDS1:

      13:37:27:Lustre: DEBUG MARKER: == replay-single test 73a: open(O_CREAT), unlink, replay, reconnect before open replay, close == 13:36:56 (1437226616)
      13:37:27:Lustre: DEBUG MARKER: sync; sync; sync
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
      13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) *** setting lustre-MDT0000 read-only ***
      13:37:27:LustreError: 14068:0:(osd_handler.c:1380:osd_ro()) Skipped 2 previous similar messages
      13:37:27:Turning device dm-0 (0xfd00000) read-only
      13:37:27:Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      13:37:27:Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
      13:37:27:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x80000302
      13:37:27:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
      13:37:27:Lustre: DEBUG MARKER: umount -d /mnt/mds1
      13:37:27:Lustre: Failing over lustre-MDT0000
      13:37:27:Lustre: Skipped 6 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 7 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping)
      13:37:27:Lustre: Skipped 5 previous similar messages
      13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      13:37:27:LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      13:37:27:LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5
      13:37:27:LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages
      13:37:27:LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      13:37:27:Lustre: Skipped 7 previous similar messages
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 1 previous similar message
      13:37:27:Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      13:37:27:Lustre: Skipped 15 previous similar messages
      14:37:08:********** Timeout by autotest system **********
      

      From the MDS syslog:

      Jul 18 13:36:58 shadow-26vm12 mrshd[14211]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/mds1";echo XXRETCODE:$?'
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: DEBUG MARKER: umount -d /mnt/mds1
      Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: EXIT: mshell status=0 pid=14210 duration=0(sec)
      Jul 18 13:36:58 shadow-26vm12 xinetd[1492]: START: mshell pid=14233 from=::ffff:10.1.5.70
      Jul 18 13:36:58 shadow-26vm12 mrshd[14234]: root@shadow-26vm10.shadow.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/mds1");echo XXRETCODE:$?'
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Failing over lustre-MDT0000
      Jul 18 13:36:58 shadow-26vm12 kernel: Lustre: Skipped 6 previous similar messages
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.71@tcp (stopping)
      Jul 18 13:37:02 shadow-26vm12 kernel: Lustre: Skipped 5 previous similar messages
      Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880058e879c0 x1507036455088328/t0(0) o13->lustre-OST0001-osc-MDT0000@10.1.5.71@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
      Jul 18 13:37:03 shadow-26vm12 kernel: LustreError: 2870:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8800592c46c0 x1507036455088364/t0(0) o1000->lustre-MDT0001-osp-MDT0000@10.1.5.64@tcp:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(client.c:1144:ptlrpc_import_delay_req()) Skipped 7 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) lustre-MDT0001-osp-MDT0000:osp_attr_get update error [0x240000403:0x1:0x0]: rc = -5
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(osp_object.c:586:osp_attr_get()) Skipped 2 previous similar messages
      Jul 18 13:37:04 shadow-26vm12 kernel: LustreError: 14256:0:(llog.c:180:llog_cancel_rec()) lustre-MDT0001-osp-MDT0000: fail to write header for llog #0x1:1073742851#00000000: rc = -5
      Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:37:05 shadow-26vm12 kernel: Lustre: Skipped 7 previous similar messages
      Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:12 shadow-26vm12 kernel: Lustre: Skipped 1 previous similar message
      Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:17 shadow-26vm12 kernel: Lustre: Skipped 15 previous similar messages
      Jul 18 13:37:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:37:25 shadow-26vm12 kernel: Lustre: Skipped 29 previous similar messages
      Jul 18 13:37:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:37:42 shadow-26vm12 kernel: Lustre: Skipped 49 previous similar messages
      Jul 18 13:37:45 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 13:38:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:38:15 shadow-26vm12 kernel: Lustre: Skipped 109 previous similar messages
      Jul 18 13:39:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:39:20 shadow-26vm12 kernel: Lustre: Skipped 207 previous similar messages
      Jul 18 13:39:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 13:41:18 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:41:30 shadow-26vm12 kernel: Lustre: Skipped 415 previous similar messages
      Jul 18 13:45:34 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      Jul 18 13:45:47 shadow-26vm12 kernel: Lustre: Skipped 817 previous similar messages
      Jul 18 13:54:06 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 512 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.69@tcp (stopping)
      Jul 18 13:54:20 shadow-26vm12 kernel: Lustre: Skipped 1645 previous similar messages
      Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.70@tcp (stopping)
      Jul 18 14:04:20 shadow-26vm12 kernel: Lustre: Skipped 1920 previous similar messages
      Jul 18 14:11:10 shadow-26vm12 kernel: Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 5. Is it stuck?
      Jul 18 14:12:53 shadow-26vm12 ntpd[2234]: can't open /var/log/ntpstats/loopstats.20150718: Permission denied
      Jul 18 14:14:22 shadow-26vm12 kernel: Lustre: lustre-MDT0000: Not available for connect from 10.1.5.64@tcp (stopping)
      

      Attachments

        Issue Links

          Activity

            [LU-6883] replay-single test 73a hang and timeout: umount
            pjones Peter Jones added a comment -

            Thanks Di!

            pjones Peter Jones added a comment - Thanks Di!

            Hmm, there are not enough information for me there to know the exact reason. But it looks like this failures is caused by something in replay-single.sh 70b, and probably the recent fixes from LU-6904 and LU-6924 already fix this issue. Let's close this issue for now if there are no failures for several weeks, and reopen it if we see it again. Thanks

            di.wang Di Wang (Inactive) added a comment - Hmm, there are not enough information for me there to know the exact reason. But it looks like this failures is caused by something in replay-single.sh 70b, and probably the recent fixes from LU-6904 and LU-6924 already fix this issue. Let's close this issue for now if there are no failures for several weeks, and reopen it if we see it again. Thanks

            Hi Di,
            Have you had a chance at all to look into this issue? Any input you may have? We are not seeing this particular failure in testing for several weeks.
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Di, Have you had a chance at all to look into this issue? Any input you may have? We are not seeing this particular failure in testing for several weeks. Thanks. Joe
            laisiyao Lai Siyao added a comment -

            I couldn't reproduce in local test system, and because this is a timeout issue, the debug log doesn't contain any useful information. I've asked Di to help look into it.

            laisiyao Lai Siyao added a comment - I couldn't reproduce in local test system, and because this is a timeout issue, the debug log doesn't contain any useful information. I've asked Di to help look into it.

            Lai, any update on this issue?

            adilger Andreas Dilger added a comment - Lai, any update on this issue?
            pjones Peter Jones added a comment -

            Lai

            Could you please look into this failure? This is causing several review failures per day atm

            Peter

            pjones Peter Jones added a comment - Lai Could you please look into this failure? This is causing several review failures per day atm Peter
            jamesanunez James Nunez (Inactive) added a comment - A couple more: 2015-07-29 09:11:10 - https://testing.hpdd.intel.com/test_sets/a14860c0-361e-11e5-b91d-5254006e85c2 2015-07-29 15:16:00 - https://testing.hpdd.intel.com/test_sets/2a79fe4c-364e-11e5-830b-5254006e85c2
            yujian Jian Yu added a comment - More failure instance on master branch: https://testing.hpdd.intel.com/test_sets/d3ae5ba6-3659-11e5-84a9-5254006e85c2
            bogl Bob Glossman (Inactive) added a comment - another: https://testing.hpdd.intel.com/test_sets/46c38a6e-3586-11e5-be21-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited More failures for master during review-dne-part-2: 2015-07-20 08:16:36 - https://testing.hpdd.intel.com/test_sets/7f9b89a2-2f02-11e5-92dd-5254006e85c2 2015-07-21 01:53:32 - https://testing.hpdd.intel.com/test_sets/052b8fe6-2fa7-11e5-97d6-5254006e85c2 2015-07-21 04:00:17 - https://testing.hpdd.intel.com/test_sets/63068e3c-2fa6-11e5-ad00-5254006e85c2 2015-07-23 20:25:45 - https://testing.hpdd.intel.com/test_sets/328ddfe4-31c7-11e5-84cf-5254006e85c2

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: