Description
At SDSC test cluster, here are what we did:
All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST
1. Upgrade all servers to 2.4.1 (actually b2_4 build 40)
2. booted server and mount MDT, OSS...no issue
3. Reformatted 3 additional OSTs on each OSS and mounted
4. mounted 1.8.9-wc1 clients but can not access the FS
Below stack dump happens on all OSS.
LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs Lustre: Skipped 18 previous similar messages LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0 LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110 Lustre: Skipped 31 previous similar messages Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting Lustre: Skipped 158 previous similar messages Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 23 previous similar messages LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110 Lustre: Skipped 64 previous similar messages Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 160 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs Lustre: Skipped 26 previous similar messages LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 57 previous similar messages Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 181 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 26 previous similar messages LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 44 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 164 previous similar messages Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs Lustre: Skipped 28 previous similar messages LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0 LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110 Lustre: Skipped 76 previous similar messages Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting Lustre: Skipped 191 previous similar messages LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 3577, comm: ll_ost_io01_083 Call Trace: [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70 [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0 [<ffffffff81081100>] ? process_timeout+0x0/0x10 [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs] [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc] [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81063330>] ? default_wake_function+0x0/0x20 [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc] [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost] [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost] [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs] [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [<ffffffff81055ad3>] ? __wake_up+0x53/0x70 [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc] [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1379967822.3577 attached is the lustre-log
We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version.