Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
None

Severity:
3
Rank (Obsolete):
5248

Description

I started to hit tihs issue frequently on repeated runs of recovery-small.

Multiop starts to loop outputting the samme message over and over again:

short read: 0/1

I logged into the node and here's what I see

[root@centos6-7 ~]# dmesg | tail
[27409.796429] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[27409.798509] LustreError: Skipped 2 previous similar messages
[27409.821171] Lustre: DEBUG MARKER: cancel_lru_locks osc start
[27410.021536] LustreError: 26667:0:(ldlm_lockd.c:2177:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1350470383 with bad export cookie 1895698488629392860
[27410.022477] LustreError: 26666:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
[27410.022985] LustreError: 167-0: lustre-OST0000-osc-ffff8800226a5bf0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[27410.023160] LustreError: 27694:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-OST0000-osc-ffff8800226a5bf0 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
[27410.023163] LustreError: 27694:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88001ab28e78 (2/0/0/0) (rc: 1)
[27410.025030] LustreError: 26666:0:(ldlm_request.c:1795:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
[27410.092031] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
[root@centos6-7 ~]# ps ax | grep multi
27707 pts/0    R+   291:38 multiop /mnt/lustre/f.recovery-small.11 or
27878 pts/1    S+     0:00 grep multi

strace on this process shows

read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0

So I did another experiment:

[root@centos6-7 ~]# ls -la /proc/27707/fd
total 0
dr-x------ 2 root root  0 Oct 17 10:56 .
dr-xr-xr-x 7 root root  0 Oct 17 10:56 ..
lrwx------ 1 root root 64 Oct 17 11:37 0 -> /dev/pts/0
l-wx------ 1 root root 64 Oct 17 11:37 1 -> pipe:[1489661]
l-wx------ 1 root root 64 Oct 17 10:56 2 -> pipe:[1489624]
lr-x------ 1 root root 64 Oct 17 11:37 3 -> /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# ls -l /mnt/lustre/f.recovery-small.11
-rw-r--r-- 1 root root 0 Oct 17 06:39 /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
cat: /mnt/lustre/f.recovery-small.11: Input/output error
[root@centos6-7 ~]# touch /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# echo $?
0
[root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]#

yet at this point multiop still did not finish its loop.

At the very least I imagine we should actually make multiop terminate on error that repeats many times.

Attachments

Issue Links

is related to

LU-1612 Test failure on test suite parallel-scale-nfsv3, subtest test_iorssf

Resolved

Activity

People

Assignee:: WC Triage

Reporter:: Oleg Drokin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Oct/12 11:47 AM

Updated:: 19/Apr/13 8:37 PM

Resolved:: 03/Nov/12 1:38 AM