[LU-3998] ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded) Created: 23/Sep/13 Updated: 24/Mar/18 Resolved: 24/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Minh Diep | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | ptr, sdsc | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 10697 |
| Description |
|
At SDSC test cluster, here are what we did: All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST 1. Upgrade all servers to 2.4.1 (actually b2_4 build 40) Below stack dump happens on all OSS. LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs Lustre: Skipped 18 previous similar messages LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0 LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110 Lustre: Skipped 31 previous similar messages Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting Lustre: Skipped 158 previous similar messages Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 23 previous similar messages LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110 Lustre: Skipped 64 previous similar messages Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 160 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs Lustre: Skipped 26 previous similar messages LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 57 previous similar messages Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 181 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 26 previous similar messages LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 44 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 164 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 28 previous similar messages LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 76 previous similar messages Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 191 previous similar messages LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 3577, comm: ll_ost_io01_083 Call Trace: [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70 [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0 [<ffffffff81081100>] ? process_timeout+0x0/0x10 [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc] [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81063330>] ? default_wake_function+0x0/0x20 [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc] [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost] [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost] [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs] [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1379967822.3577 attached is the lustre-log We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version. |
| Comments |
| Comment by Oleg Drokin [ 25/Sep/13 ] |
|
So writes take hundreds of seconds? I wonder why? |
| Comment by Peter Jones [ 25/Sep/13 ] |
|
Amir Could you please comment on this one? Thanks Peter |
| Comment by Minh Diep [ 08/Oct/13 ] |
|
Amir, After investigated, it turned out that one of the router was configured as a client. What this means (IMHO) is the client requests for reconnect and the server sent the reply to the 'bad' router and package get dropped. What shouldn't be happening is the server's thread hung. |
| Comment by Amir Shehata (Inactive) [ 09/Oct/13 ] |
|
The way the current code works, is that even if a node which is not configured as a router and is being used as a router, the clients and servers using it in their routes, will think it's up. Currently clients/servers can ping the routers on regular intervals, but the ping assumes that if the node is up then the router is up. It doesn't take into consideration if the routing service is actually on. So servers and clients will continue using that node, but that node will drop all messages not destined for it. I wonder if the server thread is hung or simply waiting for a response (ACK) on the message it just sent? From the stack trace attached it appears that it's waiting on a timeout. I suspect that after the server sends a message it'll wait for a response, but since the message is dropped at the "bad router" then no response is received and the server will eventually timeout. Am I looking at the right place? Or is there another area where it indicates that the server's thread is hung? |
| Comment by Minh Diep [ 11/Nov/13 ] |
|
Yes, Amir. I think your assessment is correct. The issue here is why the thread was hung. Shouldn't it timed out and evict/resend...? |
| Comment by Andreas Dilger [ 15/Nov/13 ] |
|
It would be useful to know if this thread became "unstuck" after this stack was printed. The stack dump itself is not surprising if there was a failed router (this is just for debugging as printed at the top of the stack):
|
| Comment by Amir Shehata (Inactive) [ 15/Nov/13 ] |
|
According to the stack trace: while ((rc == -ETIMEDOUT) && After talking with Andreas the l_wait_event() waits for the bulk to be delivered to the client or the client to be evicted. If, as in this case, the bulk is never delivered and the client is not evicted, then according to an Adaptive timeout Algorithm, the timeout can grow to a maximum of 900s. So it's quiet possible for a thread to appear hung, but it really is just waiting on the timeout. Moreover, according to Andreas, the server doesn't perform any send retries. As far as I understand the setup, it appears that there is one way communication happening. The client is able to send to the server but the server is not able to send to the client. This can be caused by the miss-configured router. If the routes are ordered such that the "bad router" is first on the server side, it'll continuously get picked when sending to the client. But on the client side the "good router" might be listed first and it'll always be picked, and thus the client is able to send to the server. So we have an asymmetric communication there. I believe this explains the aspects of the problem: |
| Comment by Peter Jones [ 24/Mar/18 ] |
|
SDSC have long ago moved to more current versions so I think that this is no longer a concern |