[LU-5768] replay-single test_52: Restart of mds1 failed: EIO Created: 20/Oct/14 Updated: 04/Nov/14 Resolved: 04/Nov/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.4 |
| Fix Version/s: | Lustre 2.5.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Li Wei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 16190 |
| Description |
|
This issue was created by maloo for Li Wei <liwei@whamcloud.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/a370c858-56b6-11e4-851f-5254006e85c2. The sub-test test_52 failed with the following error: Restart of mds1 failed! == replay-single test 52: time out lock replay (3764) == 00:55:51 (1413593751)
CMD: shadow-19vm12 sync; sync; sync
Filesystem 1K-blocks Used Available Use% Mounted on
shadow-19vm12@tcp:/lustre
22169560 1069324 19973984 6% /mnt/lustre
CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 mcreate /mnt/lustre/fsa-\$(hostname); rm /mnt/lustre/fsa-\$(hostname)
CMD: shadow-19vm10.shadow.whamcloud.com,shadow-19vm9 if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-\$(hostname); rm /mnt/lustre2/fsa-\$(hostname); fi
CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 notransno
CMD: shadow-19vm12 /usr/sbin/lctl --device lustre-MDT0000 readonly
CMD: shadow-19vm12 /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
CMD: shadow-19vm12 lctl set_param fail_loc=0x8000030c
fail_loc=0x8000030c
Failing mds1 on shadow-19vm12
CMD: shadow-19vm12 grep -c /mnt/mds1' ' /proc/mounts
Stopping /mnt/mds1 (opts:) on shadow-19vm12
CMD: shadow-19vm12 umount -d /mnt/mds1
CMD: shadow-19vm12 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
reboot facets: mds1
Failover mds1 to shadow-19vm12
00:56:11 (1413593771) waiting for shadow-19vm12 network 900 secs ...
00:56:11 (1413593771) network interface is UP
CMD: shadow-19vm12 hostname
mount facets: mds1
CMD: shadow-19vm12 test -b /dev/lvm-Role_MDS/P1
Starting mds1: /dev/lvm-Role_MDS/P1 /mnt/mds1
CMD: shadow-19vm12 mkdir -p /mnt/mds1; mount -t lustre /dev/lvm-Role_MDS/P1 /mnt/mds1
shadow-19vm12: mount.lustre: mount /dev/mapper/lvm--Role_MDS-P1 at /mnt/mds1 failed: Input/output error
shadow-19vm12: Is the MGS running?
Start of /dev/lvm-Role_MDS/P1 on mds1 failed 5
replay-single test_52: @@@@@@ FAIL: Restart of mds1 failed!
Info required for matching: replay-single 52 |
| Comments |
| Comment by Li Wei (Inactive) [ 21/Oct/14 ] |
00000100:00100000:1.0:1413801598.098879:0:27934:0:(service.c:1874:ptlrpc_server_handle_req_in()) got req x1482469562502264 00000100:00100000:1.0:1413801598.098950:0:27934:0:(nrs_fifo.c:182:nrs_fifo_req_get()) NRS start fifo request from 12345-0@lo, seq: 2 00000100:00100000:1.0:1413801598.098961:0:27934:0:(service.c:2023:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_mgs_0002:4a248d11-7d6c-a58c-f094-139fe352b1dc+4:27909:x1482469562502264:12345-0@lo:101 00010000:00010000:1.0:1413801598.098973:0:27934:0:(ldlm_lockd.c:1179:ldlm_handle_enqueue0()) ### server-side enqueue handler START 00010000:00010000:1.0:1413801598.098987:0:27934:0:(ldlm_lockd.c:1267:ldlm_handle_enqueue0()) ### server-side enqueue handler, new lock created ns: MGS lock: ffff88005d9e6dc0/0x465b324ce97ba2f6 lrc: 2/0,0 mode: --/CR res: [0x65727473756c:0x0:0x0].0 rrc: 1 type: PLN flags: 0x40000000000000 nid: local remote: 0x465b324ce97ba2ef expref: -99 pid: 27934 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1413801598.099017:0:27934:0:(ldlm_lock.c:1089:ldlm_granted_list_add_lock()) ### About to add lock: ns: MGS lock: ffff88005d9e6dc0/0x465b324ce97ba2f6 lrc: 3/0,0 mode: CR/CR res: [0x65727473756c:0x0:0x0].0 rrc: 1 type: PLN flags: 0x50000000000000 nid: 0@lo remote: 0x465b324ce97ba2ef expref: 5 pid: 27934 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1413801598.099031:0:27934:0:(ldlm_lockd.c:1405:ldlm_handle_enqueue0()) ### server-side enqueue handler, sending reply(err=0, rc=0) ns: MGS lock: ffff88005d9e6dc0/0x465b324ce97ba2f6 lrc: 3/0,0 mode: CR/CR res: [0x65727473756c:0x0:0x0].0 rrc: 1 type: PLN flags: 0x40000000000000 nid: 0@lo remote: 0x465b324ce97ba2ef expref: 5 pid: 27934 timeout: 0 lvb_type: 0 00010000:00010000:1.0:1413801598.099040:0:27934:0:(ldlm_lockd.c:1448:ldlm_handle_enqueue0()) ### server-side enqueue handler END (lock ffff88005d9e6dc0, rc 0) 00010000:02000000:1.0:1413801598.099046:0:27934:0:(libcfs_fail.h:89:cfs_fail_check_set()) *** cfs_fail_loc=30c, val=2147483648*** 00010000:00020000:1.0:1413801598.100035:0:27934:0:(ldlm_lib.c:2415:target_send_reply_msg()) @@@ dropping reply req@ffff88007d2f2050 x1482469562502264/t0(0) o101->4a248d11-7d6c-a58c-f094-139fe352b1dc@0@lo:0/0 lens 328/344 e 0 to 0 dl 1413801604 ref 1 fl Interpret:/0/0 rc 0/0 The request dropped was actually an attempt to acquire an MGS lock. Regardless of how this suddenly showed up (d29c0438 could be a factor), the test is not working properly at the moment. I suggest adding it to the exception list until a fix is ready. |
| Comment by Li Wei (Inactive) [ 21/Oct/14 ] |
|
http://review.whamcloud.com/12355 (Only disables the test; not a fix) |
| Comment by Bob Glossman (Inactive) [ 21/Oct/14 ] |
|
another seen in b2_5: |
| Comment by Jian Yu [ 21/Oct/14 ] |
|
This is a regression failure introduced by the following commit on Lustre b2_5 branch: commit d29c0438bdf38e89d5638030b3770d7740121f8d
Author: Vitaly Fertman <vitaly_fertman@xyratex.com>
Date: Mon Sep 29 19:42:32 2014 -0400
LU-5579 ldlm: re-sent enqueue vs lock destroy race
upon lock enqueue re-send, lock is pinned by ldlm_handle_enqueue0,
however it may race with client eviction or even lcok cancel (if
a reply for the original RPC finally reached the client) and the
lock cann be found by cookie anymore:
ASSERTION( lock != NULL ) failed: Invalid lock handle
Signed-off-by: Vitaly Fertman <vitaly_fertman@xyratex.com>
Change-Id: I9d8156bf78a1b83ac22ffaa1148feb43bef37b1a
Xyratex-bug-id: MRP-2094
This is blocking patch review testing on Lustre b2_5 branch. Oleg, could you please revert it? Thanks! |
| Comment by James A Simmons [ 21/Oct/14 ] |
|
Please don't revert. This patch fixes real issues for us at ORNL. Could we figure out a proper fix instead. |
| Comment by Li Wei (Inactive) [ 22/Oct/14 ] |
|
Indeed. Another temporary workaround could be just reverting the test part of the patch, including the change to tgt_enqueue(). In addition to this test, replay-single 73b suffers from the same problem on b2_5. |
| Comment by Oleg Drokin [ 22/Oct/14 ] |
|
I will try to revert just the test. This is my fault, I removed the master test, but did not notice b2_5 patch also had this. My internet connection right now is super far from being good, so it might take few days until I get to a good enough one. |
| Comment by Li Wei (Inactive) [ 22/Oct/14 ] |
|
http://review.whamcloud.com/12390 (Revert the test part of d29c0438) |
| Comment by Bob Glossman (Inactive) [ 24/Oct/14 ] |
|
more seen on b2_5: Does seem to be blocking in current b2_5 test runs. |
| Comment by Peter Jones [ 04/Nov/14 ] |
|
This has been resolved by http://git.whamcloud.com/fs/lustre-release.git/commit/93423cc9114721f32e5c36e21a8b56d2a463125b |