[LU-5768] replay-single test_52: Restart of mds1 failed: EIO - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.5.4
Affects Version/s: Lustre 2.5.4
Labels:
None

Severity:
3
Rank (Obsolete):
16190

Description

This issue was created by maloo for Li Wei <liwei@whamcloud.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a370c858-56b6-11e4-851f-5254006e85c2.

The sub-test test_52 failed with the following error:

Restart of mds1 failed!

== replay-single test 52: time out lock replay (3764) == 00:55:51 (1413593751)
CMD: shadow-19vm12 sync; sync; sync
Filesystem           1K-blocks    Used Available Use% Mounted on
shadow-19vm12@tcp:/lustre
                      22169560 1069324  19973984   6% /mnt/lustre
CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: shadow-19vm12 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
CMD: shadow-19vm12 lctl set_param fail_loc=0x8000030c
fail_loc=0x8000030c
Failing mds1 on shadow-19vm12
CMD: shadow-19vm12 grep -c /mnt/mds1' ' /proc/mounts
Stopping /mnt/mds1 (opts:) on shadow-19vm12
CMD: shadow-19vm12 umount -d /mnt/mds1
CMD: shadow-19vm12 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: mds1
Failover mds1 to shadow-19vm12
00:56:11 (1413593771) waiting for shadow-19vm12 network 900 secs ...
00:56:11 (1413593771) network interface is UP
CMD: shadow-19vm12 hostname
mount facets: mds1
CMD: shadow-19vm12 test -b /dev/lvm-Role_MDS/P1
Starting mds1:   /dev/lvm-Role_MDS/P1 /mnt/mds1
CMD: shadow-19vm12 mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
shadow-19vm12: mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/mds1 failed: Input/output error
shadow-19vm12: Is the MGS running?
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5
 replay-single test_52: @@@@@@ FAIL: Restart of mds1 failed!

Info required for matching: replay-single 52

Attachments

Activity

[LU-5768] replay-single test_52: Restart of mds1 failed: EIO

Peter Jones added a comment - 04/Nov/14 3:56 AM

This has been resolved by http://git.whamcloud.com/fs/lustre-release.git/commit/93423cc9114721f32e5c36e21a8b56d2a463125b

Peter Jones added a comment - 04/Nov/14 3:56 AM This has been resolved by http://git.whamcloud.com/fs/lustre-release.git/commit/93423cc9114721f32e5c36e21a8b56d2a463125b

Bob Glossman (Inactive) added a comment - 24/Oct/14 2:54 PM

more seen on b2_5:
https://testing.hpdd.intel.com/test_sets/8f810a18-5b86-11e4-8b14-5254006e85c2
https://testing.hpdd.intel.com/test_sets/1d38451e-5b7e-11e4-95e9-5254006e85c2

Does seem to be blocking in current b2_5 test runs.

Bob Glossman (Inactive) added a comment - 24/Oct/14 2:54 PM more seen on b2_5: https://testing.hpdd.intel.com/test_sets/8f810a18-5b86-11e4-8b14-5254006e85c2 https://testing.hpdd.intel.com/test_sets/1d38451e-5b7e-11e4-95e9-5254006e85c2 Does seem to be blocking in current b2_5 test runs.

Li Wei (Inactive) added a comment - 22/Oct/14 1:33 PM

http://review.whamcloud.com/12390 (Revert the test part of d29c0438)

Li Wei (Inactive) added a comment - 22/Oct/14 1:33 PM http://review.whamcloud.com/12390 (Revert the test part of d29c0438)

Oleg Drokin added a comment - 22/Oct/14 12:58 PM

I will try to revert just the test. This is my fault, I removed the master test, but did not notice b2_5 patch also had this.

My internet connection right now is super far from being good, so it might take few days until I get to a good enough one.

Oleg Drokin added a comment - 22/Oct/14 12:58 PM I will try to revert just the test. This is my fault, I removed the master test, but did not notice b2_5 patch also had this. My internet connection right now is super far from being good, so it might take few days until I get to a good enough one.

Li Wei (Inactive) added a comment - 22/Oct/14 1:25 AM

Indeed. Another temporary workaround could be just reverting the test part of the patch, including the change to tgt_enqueue().

In addition to this test, replay-single 73b suffers from the same problem on b2_5.

Li Wei (Inactive) added a comment - 22/Oct/14 1:25 AM Indeed. Another temporary workaround could be just reverting the test part of the patch, including the change to tgt_enqueue(). In addition to this test, replay-single 73b suffers from the same problem on b2_5.

James A Simmons added a comment - 21/Oct/14 9:33 PM

Please don't revert. This patch fixes real issues for us at ORNL. Could we figure out a proper fix instead.

James A Simmons added a comment - 21/Oct/14 9:33 PM Please don't revert. This patch fixes real issues for us at ORNL. Could we figure out a proper fix instead.

Jian Yu added a comment - 21/Oct/14 9:11 PM

This is a regression failure introduced by the following commit on Lustre b2_5 branch:

commit d29c0438bdf38e89d5638030b3770d7740121f8d
Author: Vitaly Fertman <vitaly_fertman@xyratex.com>
Date:   Mon Sep 29 19:42:32 2014 -0400

    LU-5579 ldlm: re-sent enqueue vs lock destroy race
    
    upon lock enqueue re-send, lock is pinned by ldlm_handle_enqueue0,
    however it may race with client eviction or even lcok cancel (if
    a reply for the original RPC finally reached the client) and the
    lock cann be found by cookie anymore:
    
     ASSERTION( lock != NULL ) failed: Invalid lock handle
    
    Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com>
    Change-Id: I9d8156bf78a1b83ac22ffaa1148feb43bef37b1a
    Xyratex-bug-id: MRP-2094

This is blocking patch review testing on Lustre b2_5 branch. Oleg, could you please revert it? Thanks!

Jian Yu added a comment - 21/Oct/14 9:11 PM This is a regression failure introduced by the following commit on Lustre b2_5 branch: commit d29c0438bdf38e89d5638030b3770d7740121f8d Author: Vitaly Fertman <vitaly_fertman@xyratex.com> Date: Mon Sep 29 19:42:32 2014 -0400 LU-5579 ldlm: re-sent enqueue vs lock destroy race upon lock enqueue re-send, lock is pinned by ldlm_handle_enqueue0, however it may race with client eviction or even lcok cancel (if a reply for the original RPC finally reached the client) and the lock cann be found by cookie anymore: ASSERTION( lock != NULL ) failed: Invalid lock handle Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com> Change-Id: I9d8156bf78a1b83ac22ffaa1148feb43bef37b1a Xyratex-bug-id: MRP-2094 This is blocking patch review testing on Lustre b2_5 branch. Oleg, could you please revert it? Thanks!

Bob Glossman (Inactive) added a comment - 21/Oct/14 3:03 PM

another seen in b2_5:
https://testing.hpdd.intel.com/test_sets/3bbdec78-590d-11e4-9a49-5254006e85c2

Bob Glossman (Inactive) added a comment - 21/Oct/14 3:03 PM another seen in b2_5: https://testing.hpdd.intel.com/test_sets/3bbdec78-590d-11e4-9a49-5254006e85c2

People

Assignee:: Li Wei (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 20/Oct/14 1:26 AM

Updated:: 04/Nov/14 3:56 AM

Resolved:: 04/Nov/14 3:56 AM