[LU-3998] ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded) Created: 23/Sep/13  Updated: 24/Mar/18  Resolved: 24/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Minh Diep Assignee: Amir Shehata (Inactive)
Resolution: Won't Fix Votes: 0
Labels: ptr, sdsc

Attachments: File lustre-log.1379967822.3577.txt.gz    
Severity: 3
Rank (Obsolete): 10697

 Description   

At SDSC test cluster, here are what we did:

All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST

1. Upgrade all servers to 2.4.1 (actually b2_4 build 40)
2. booted server and mount MDT, OSS...no issue
3. Reformatted 3 additional OSTs on each OSS and mounted
4. mounted 1.8.9-wc1 clients but can not access the FS

Below stack dump happens on all OSS.

LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs
Lustre: Skipped 18 previous similar messages
LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0
LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages
Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
Lustre: Skipped 31 previous similar messages
Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting
Lustre: Skipped 158 previous similar messages
Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 23 previous similar messages
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages
Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
Lustre: Skipped 64 previous similar messages
Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 160 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs
Lustre: Skipped 26 previous similar messages
LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 57 previous similar messages
Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 181 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 26 previous similar messages
LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 44 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 164 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 28 previous similar messages
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 76 previous similar messages
Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 191 previous similar messages
LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 3577, comm: ll_ost_io01_083


Call Trace:
 [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0
 [<ffffffff81081100>] ? process_timeout+0x0/0x10
 [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
 [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81063330>] ? default_wake_function+0x0/0x20
 [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc]
 [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost]
 [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost]
 [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
 [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1379967822.3577
attached is the lustre-log

We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version.



 Comments   
Comment by Oleg Drokin [ 25/Sep/13 ]

So writes take hundreds of seconds? I wonder why?

Comment by Peter Jones [ 25/Sep/13 ]

Amir

Could you please comment on this one?

Thanks

Peter

Comment by Minh Diep [ 08/Oct/13 ]

Amir,

After investigated, it turned out that one of the router was configured as a client. What this means (IMHO) is the client requests for reconnect and the server sent the reply to the 'bad' router and package get dropped. What shouldn't be happening is the server's thread hung.

Comment by Amir Shehata (Inactive) [ 09/Oct/13 ]

The way the current code works, is that even if a node which is not configured as a router and is being used as a router, the clients and servers using it in their routes, will think it's up. Currently clients/servers can ping the routers on regular intervals, but the ping assumes that if the node is up then the router is up. It doesn't take into consideration if the routing service is actually on. So servers and clients will continue using that node, but that node will drop all messages not destined for it.

I wonder if the server thread is hung or simply waiting for a response (ACK) on the message it just sent? From the stack trace attached it appears that it's waiting on a timeout. I suspect that after the server sends a message it'll wait for a response, but since the message is dropped at the "bad router" then no response is received and the server will eventually timeout.

Am I looking at the right place? Or is there another area where it indicates that the server's thread is hung?

Comment by Minh Diep [ 11/Nov/13 ]

Yes, Amir. I think your assessment is correct. The issue here is why the thread was hung. Shouldn't it timed out and evict/resend...?

Comment by Andreas Dilger [ 15/Nov/13 ]

It would be useful to know if this thread became "unstuck" after this stack was printed. The stack dump itself is not surprising if there was a failed router (this is just for debugging as printed at the top of the stack):

The thread might be hung, or it might only be slow and will resume later.

Comment by Amir Shehata (Inactive) [ 15/Nov/13 ]

According to the stack trace:
[<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
This references:
do

{ long timeoutl = req->rq_deadline - cfs_time_current_sec(); cfs_duration_t timeout = timeoutl <= 0 ? CFS_TICK : cfs_time_seconds(timeoutl); *lwi = LWI_TIMEOUT_INTERVAL(timeout, cfs_time_seconds(1), target_bulk_timeout, desc); rc = l_wait_event(desc->bd_waitq, <--- !ptlrpc_server_bulk_active(desc) || exp->exp_failed || exp->exp_abort_active_req, lwi); LASSERT(rc == 0 || rc == -ETIMEDOUT); /* Wait again if we changed deadline. */ }

while ((rc == -ETIMEDOUT) &&
(req->rq_deadline > cfs_time_current_sec()));

After talking with Andreas the l_wait_event() waits for the bulk to be delivered to the client or the client to be evicted. If, as in this case, the bulk is never delivered and the client is not evicted, then according to an Adaptive timeout Algorithm, the timeout can grow to a maximum of 900s. So it's quiet possible for a thread to appear hung, but it really is just waiting on the timeout.

Moreover, according to Andreas, the server doesn't perform any send retries.

As far as I understand the setup, it appears that there is one way communication happening. The client is able to send to the server but the server is not able to send to the client. This can be caused by the miss-configured router. If the routes are ordered such that the "bad router" is first on the server side, it'll continuously get picked when sending to the client. But on the client side the "good router" might be listed first and it'll always be picked, and thus the client is able to send to the server. So we have an asymmetric communication there.

I believe this explains the aspects of the problem:
1. the client not able to mount the FS
2. The message that dumps the server stack on the server side due to >200s of inactivity

Comment by Peter Jones [ 24/Mar/18 ]

SDSC have long ago moved to more current versions so I think that this is no longer a concern

Generated at Sat Feb 10 01:38:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.