Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5768

replay-single test_52: Restart of mds1 failed: EIO

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.5.4
    • Lustre 2.5.4
    • None
    • 3
    • 16190

    Description

      This issue was created by maloo for Li Wei <liwei@whamcloud.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a370c858-56b6-11e4-851f-5254006e85c2.

      The sub-test test_52 failed with the following error:

      Restart of mds1 failed!
      
      == replay-single test 52: time out lock replay (3764) == 00:55:51 (1413593751)
      CMD: shadow-19vm12 sync; sync; sync
      Filesystem           1K-blocks    Used Available Use% Mounted on
      shadow-19vm12@tcp:/lustre
                            22169560 1069324  19973984   6% /mnt/lustre
      CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
      CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
      CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 notransno
      CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 readonly
      CMD: shadow-19vm12 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
      CMD: shadow-19vm12 lctl set_param fail_loc=0x8000030c
      fail_loc=0x8000030c
      Failing mds1 on shadow-19vm12
      CMD: shadow-19vm12 grep -c /mnt/mds1' ' /proc/mounts
      Stopping /mnt/mds1 (opts:) on shadow-19vm12
      CMD: shadow-19vm12 umount -d /mnt/mds1
      CMD: shadow-19vm12 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      reboot facets: mds1
      Failover mds1 to shadow-19vm12
      00:56:11 (1413593771) waiting for shadow-19vm12 network 900 secs ...
      00:56:11 (1413593771) network interface is UP
      CMD: shadow-19vm12 hostname
      mount facets: mds1
      CMD: shadow-19vm12 test -b /dev/lvm-Role_MDS/P1
      Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/mds1
      CMD: shadow-19vm12 mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
      shadow-19vm12: mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/mds1 failed: Input/output error
      shadow-19vm12: Is the MGS running?
      Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5
       replay-single test_52: @@@@@@ FAIL: Restart of mds1 failed! 
      

      Info required for matching: replay-single 52

      Attachments

        Activity

          [LU-5768] replay-single test_52: Restart of mds1 failed: EIO
          pjones Peter Jones made changes -
          Labels Original: mq414
          pjones Peter Jones made changes -
          Fix Version/s New: Lustre 2.5.4 [ 11190 ]
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]
          pjones Peter Jones added a comment - This has been resolved by http://git.whamcloud.com/fs/lustre-release.git/commit/93423cc9114721f32e5c36e21a8b56d2a463125b
          bogl Bob Glossman (Inactive) added a comment - more seen on b2_5: https://testing.hpdd.intel.com/test_sets/8f810a18-5b86-11e4-8b14-5254006e85c2 https://testing.hpdd.intel.com/test_sets/1d38451e-5b7e-11e4-95e9-5254006e85c2 Does seem to be blocking in current b2_5 test runs.
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Li Wei [ liwei ]

          http://review.whamcloud.com/12390 (Revert the test part of d29c0438)

          liwei Li Wei (Inactive) added a comment - http://review.whamcloud.com/12390 (Revert the test part of d29c0438)
          green Oleg Drokin added a comment -

          I will try to revert just the test. This is my fault, I removed the master test, but did not notice b2_5 patch also had this.

          My internet connection right now is super far from being good, so it might take few days until I get to a good enough one.

          green Oleg Drokin added a comment - I will try to revert just the test. This is my fault, I removed the master test, but did not notice b2_5 patch also had this. My internet connection right now is super far from being good, so it might take few days until I get to a good enough one.

          Indeed. Another temporary workaround could be just reverting the test part of the patch, including the change to tgt_enqueue().

          In addition to this test, replay-single 73b suffers from the same problem on b2_5.

          liwei Li Wei (Inactive) added a comment - Indeed. Another temporary workaround could be just reverting the test part of the patch, including the change to tgt_enqueue(). In addition to this test, replay-single 73b suffers from the same problem on b2_5.

          Please don't revert. This patch fixes real issues for us at ORNL. Could we figure out a proper fix instead.

          simmonsja James A Simmons added a comment - Please don't revert. This patch fixes real issues for us at ORNL. Could we figure out a proper fix instead.
          yujian Jian Yu made changes -
          Labels New: mq414
          Priority Original: Minor [ 4 ] New: Blocker [ 1 ]

          People

            liwei Li Wei (Inactive)
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: