[LU-2205] recovery-small test 11 multiop eternal loop Created: 17/Oct/12  Updated: 19/Apr/13  Resolved: 03/Nov/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-1612 Test failure on test suite parallel-s... Resolved
Severity: 3
Rank (Obsolete): 5248

 Description   

I started to hit tihs issue frequently on repeated runs of recovery-small.

Multiop starts to loop outputting the samme message over and over again:

short read: 0/1

I logged into the node and here's what I see

[root@centos6-7 ~]# dmesg | tail
[27409.796429] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[27409.798509] LustreError: Skipped 2 previous similar messages
[27409.821171] Lustre: DEBUG MARKER: cancel_lru_locks osc start
[27410.021536] LustreError: 26667:0:(ldlm_lockd.c:2177:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1350470383 with bad export cookie 1895698488629392860
[27410.022477] LustreError: 26666:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway
[27410.022985] LustreError: 167-0: lustre-OST0000-osc-ffff8800226a5bf0: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[27410.023160] LustreError: 27694:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-OST0000-osc-ffff8800226a5bf0 resource refcount nonzero (1) after lock cleanup; forcing cleanup.
[27410.023163] LustreError: 27694:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88001ab28e78 (2/0/0/0) (rc: 1)
[27410.025030] LustreError: 26666:0:(ldlm_request.c:1795:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107
[27410.092031] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
[root@centos6-7 ~]# ps ax | grep multi
27707 pts/0    R+   291:38 multiop /mnt/lustre/f.recovery-small.11 or
27878 pts/1    S+     0:00 grep multi

strace on this process shows

read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0
write(2, "short read: 0/1\n", 16)       = 16
read(3, "", 1)                          = 0

So I did another experiment:

[root@centos6-7 ~]# ls -la /proc/27707/fd
total 0
dr-x------ 2 root root  0 Oct 17 10:56 .
dr-xr-xr-x 7 root root  0 Oct 17 10:56 ..
lrwx------ 1 root root 64 Oct 17 11:37 0 -> /dev/pts/0
l-wx------ 1 root root 64 Oct 17 11:37 1 -> pipe:[1489661]
l-wx------ 1 root root 64 Oct 17 10:56 2 -> pipe:[1489624]
lr-x------ 1 root root 64 Oct 17 11:37 3 -> /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# ls -l /mnt/lustre/f.recovery-small.11
-rw-r--r-- 1 root root 0 Oct 17 06:39 /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
cat: /mnt/lustre/f.recovery-small.11: Input/output error
[root@centos6-7 ~]# touch /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# echo $?
0
[root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11
[root@centos6-7 ~]# 

yet at this point multiop still did not finish its loop.

At the very least I imagine we should actually make multiop terminate on error that repeats many times.



 Comments   
Comment by Andreas Dilger [ 17/Oct/12 ]

This is fixed with my patch in http://review.whamcloud.com/4265 "LU-1538 tests: fix test cases when OST is full". The multiop script should return an error if it cannot read the requested number of bytes, otherwise there is no way for the caller to know something went wrong.

Comment by Oleg Drokin [ 18/Oct/12 ]

Well, that just papers over the symptoms, is not it? I get I/O error reading this apparently empty file (with cat) which cannot be right.

Comment by Oleg Drokin [ 18/Oct/12 ]

btw only first cat gives i/o error, subsequent cats don't

Comment by Andreas Dilger [ 18/Oct/12 ]

It doesn't paper over the symptoms at all. Instead of multiop being stuck in a loop forever, it returns an error to the caller, so the test can fail in some meaningful manner.

Comment by Oleg Drokin [ 27/Oct/12 ]

well, I guess so.
Now test 11 fails 100% of the time for me though I believe it passed at least some percent of the time before.

Comment by Oleg Drokin [ 27/Oct/12 ]

hm, spoke too soon, it does pass from time to time, but fails much more frequently apparently.

Comment by Oleg Drokin [ 03/Nov/12 ]

Fixed with change 4265 landed

Generated at Sat Feb 10 01:23:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.