Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
None
-
Lustre 2.4.0
-
None
-
3
-
5248
Description
I started to hit tihs issue frequently on repeated runs of recovery-small.
Multiop starts to loop outputting the samme message over and over again:
short read: 0/1
I logged into the node and here's what I see
[root@centos6-7 ~]# dmesg | tail [27409.796429] LustreError: 138-a: lustre-OST0001: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107 [27409.798509] LustreError: Skipped 2 previous similar messages [27409.821171] Lustre: DEBUG MARKER: cancel_lru_locks osc start [27410.021536] LustreError: 26667:0:(ldlm_lockd.c:2177:ldlm_cancel_handler()) ldlm_cancel from 0@lo arrived at 1350470383 with bad export cookie 1895698488629392860 [27410.022477] LustreError: 26666:0:(ldlm_request.c:1169:ldlm_cli_cancel_req()) Got rc -107 from cancel RPC: canceling anyway [27410.022985] LustreError: 167-0: lustre-OST0000-osc-ffff8800226a5bf0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [27410.023160] LustreError: 27694:0:(ldlm_resource.c:761:ldlm_resource_complain()) Namespace lustre-OST0000-osc-ffff8800226a5bf0 resource refcount nonzero (1) after lock cleanup; forcing cleanup. [27410.023163] LustreError: 27694:0:(ldlm_resource.c:767:ldlm_resource_complain()) Resource: ffff88001ab28e78 (2/0/0/0) (rc: 1) [27410.025030] LustreError: 26666:0:(ldlm_request.c:1795:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -107 [27410.092031] Lustre: DEBUG MARKER: cancel_lru_locks osc stop [root@centos6-7 ~]# ps ax | grep multi 27707 pts/0 R+ 291:38 multiop /mnt/lustre/f.recovery-small.11 or 27878 pts/1 S+ 0:00 grep multi
strace on this process shows
read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0 write(2, "short read: 0/1\n", 16) = 16 read(3, "", 1) = 0
So I did another experiment:
[root@centos6-7 ~]# ls -la /proc/27707/fd total 0 dr-x------ 2 root root 0 Oct 17 10:56 . dr-xr-xr-x 7 root root 0 Oct 17 10:56 .. lrwx------ 1 root root 64 Oct 17 11:37 0 -> /dev/pts/0 l-wx------ 1 root root 64 Oct 17 11:37 1 -> pipe:[1489661] l-wx------ 1 root root 64 Oct 17 10:56 2 -> pipe:[1489624] lr-x------ 1 root root 64 Oct 17 11:37 3 -> /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# ls -l /mnt/lustre/f.recovery-small.11 -rw-r--r-- 1 root root 0 Oct 17 06:39 /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11 cat: /mnt/lustre/f.recovery-small.11: Input/output error [root@centos6-7 ~]# touch /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]# echo $? 0 [root@centos6-7 ~]# cat /mnt/lustre/f.recovery-small.11 [root@centos6-7 ~]#
yet at this point multiop still did not finish its loop.
At the very least I imagine we should actually make multiop terminate on error that repeats many times.
Attachments
Issue Links
- is related to
-
LU-1612 Test failure on test suite parallel-scale-nfsv3, subtest test_iorssf
-
- Resolved
-