[LU-2205] recovery-small test 11 multiop eternal loop Created: 17/Oct/12 Updated: 19/Apr/13 Resolved: 03/Nov/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Oleg Drokin | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5248 | ||||||||
| Description |
|
I started to hit tihs issue frequently on repeated runs of recovery-small. Multiop starts to loop outputting the samme message over and over again: short read: 0/1 I logged into the node and here's what I see [root@centos6-7 ~]# dmesg | tail [27409.796429] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107 [27409.798509] LustreError: Skipped 2 previous similar messages [27409.821171] Lustre: DEBUG MARKER: cancel_lru_locks osc start [27410.021536] LustreError: 26667:0:(ldlm_lockd.c:2177:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1350470383 with bad export cookie 1895698488629392860 [27410.022477] LustreError: 26666:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway [27410.022985] LustreError: 167-0: lustre-OST0000-osc-ffff8800226a5bf0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [27410.023160] LustreError: 27694:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-OST0000-osc-ffff8800226a5bf0 resource refcount nonzero (1) after lock cleanup; forcing cleanup. [27410.023163] LustreError: 27694:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88001ab28e78 (2/0/0/0) (rc: 1) [27410.025030] LustreError: 26666:0:(ldlm_request.c:1795:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 [27410.092031] Lustre: DEBUG MARKER: cancel_lru_locks osc stop [root@centos6-7 ~]# ps ax | grep multi 27707 pts/0 R+ 291:38 multiop /mnt/lustre/f.recovery-small.11 or 27878 pts/1 S+ 0:00 grep multi strace on this process shows read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 So I did another experiment: [root@centos6-7 ~]# ls -la /proc/27707/fd total 0 dr-x------ 2 root root 0 Oct 17 10:56 . dr-xr-xr-x 7 root root 0 Oct 17 10:56 .. lrwx------ 1 root root 64 Oct 17 11:37 0 -> /dev/pts/0 l-wx------ 1 root root 64 Oct 17 11:37 1 -> pipe:[1489661] l-wx------ 1 root root 64 Oct 17 10:56 2 -> pipe:[1489624] lr-x------ 1 root root 64 Oct 17 11:37 3 -> /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# ls -l /mnt/lustre/f.recovery-small.11 -rw-r--r-- 1 root root 0 Oct 17 06:39 /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11 cat: /mnt/lustre/f.recovery-small.11: Input/output error [root@centos6-7 ~]# touch /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# echo $? 0 [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# yet at this point multiop still did not finish its loop. At the very least I imagine we should actually make multiop terminate on error that repeats many times. |
| Comments |
| Comment by Andreas Dilger [ 17/Oct/12 ] |
|
This is fixed with my patch in http://review.whamcloud.com/4265 " |
| Comment by Oleg Drokin [ 18/Oct/12 ] |
|
Well, that just papers over the symptoms, is not it? I get I/O error reading this apparently empty file (with cat) which cannot be right. |
| Comment by Oleg Drokin [ 18/Oct/12 ] |
|
btw only first cat gives i/o error, subsequent cats don't |
| Comment by Andreas Dilger [ 18/Oct/12 ] |
|
It doesn't paper over the symptoms at all. Instead of multiop being stuck in a loop forever, it returns an error to the caller, so the test can fail in some meaningful manner. |
| Comment by Oleg Drokin [ 27/Oct/12 ] |
|
well, I guess so. |
| Comment by Oleg Drokin [ 27/Oct/12 ] |
|
hm, spoke too soon, it does pass from time to time, but fails much more frequently apparently. |
| Comment by Oleg Drokin [ 03/Nov/12 ] |
|
Fixed with change 4265 landed |