Loading...

Details

Type: Bug
Resolution: Won't Fix
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.1
Labels:
- ptr
- sdsc

Severity:
3
Rank (Obsolete):
10697

Description

At SDSC test cluster, here are what we did:

All clients and server were running 1.8.9-wc1 with 1 MDS, 4 OSS. Each OSS has only 1 OST

1. Upgrade all servers to 2.4.1 (actually b2_4 build 40)
2. booted server and mount MDT, OSS...no issue
3. Reformatted 3 additional OSTs on each OSS and mounted
4. mounted 1.8.9-wc1 clients but can not access the FS

Below stack dump happens on all OSS.

LNet: Service thread pid 3541 completed after 430.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 5 active RPCs
Lustre: Skipped 18 previous similar messages
LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880410b31400 x1446546855247995/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379965478 ref 1 fl Interpret:/0/0 rc 0/0
LustreError: 6262:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 32 previous similar messages
Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
Lustre: Skipped 31 previous similar messages
Lustre: rhino-OST0001: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) reconnecting
Lustre: Skipped 158 previous similar messages
Lustre: rhino-OST0015: Client 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 23 previous similar messages
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff880182d8a800 x1446546855249432/t0(0) o4->53e70e6e-2fe3-c152-77d3-3aed9ca43fc1@10.2.255.254@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966105 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 63 previous similar messages
Lustre: rhino-OST0015: Bulk IO write error with 53e70e6e-2fe3-c152-77d3-3aed9ca43fc1 (at 10.2.255.254@o2ib), client will retry: rc -110
Lustre: Skipped 64 previous similar messages
Lustre: rhino-OST0005: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 160 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 2 active RPCs
Lustre: Skipped 26 previous similar messages
LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88027a8b8000 x1446547676281180/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966444 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 6258:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 57 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 57 previous similar messages
Lustre: rhino-OST000d: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 181 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 26 previous similar messages
LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff88033304dc00 x1446547676282609/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379966879 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 8027:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 44 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 44 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 164 previous similar messages
Lustre: rhino-OST0011: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) refused reconnection, still busy with 1 active RPCs
Lustre: Skipped 28 previous similar messages
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) @@@ Reconnect on bulk GET  req@ffff8804451a4400 x1446547676284352/t0(0) o4->9fb61921-0ff1-2363-1676-c8360f84f18d@10.2.255.252@o2ib:0/0 lens 448/448 e 0 to 0 dl 1379967635 ref 1 fl Interpret:/2/0 rc 0/0
LustreError: 3520:0:(ldlm_lib.c:2711:target_bulk_io()) Skipped 76 previous similar messages
Lustre: rhino-OST0011: Bulk IO write error with 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib), client will retry: rc -110
Lustre: Skipped 76 previous similar messages
Lustre: rhino-OST0015: Client 9fb61921-0ff1-2363-1676-c8360f84f18d (at 10.2.255.252@o2ib) reconnecting
Lustre: Skipped 191 previous similar messages
LNet: Service thread pid 3577 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 3577, comm: ll_ost_io01_083


Call Trace:
 [<ffffffff81080fec>] ? lock_timer_base+0x3c/0x70
 [<ffffffff8150ef72>] schedule_timeout+0x192/0x2e0
 [<ffffffff81081100>] ? process_timeout+0x0/0x10
 [<ffffffffa03666d1>] cfs_waitq_timedwait+0x11/0x20 [libcfs]
 [<ffffffffa061b608>] target_bulk_io+0x3b8/0x910 [ptlrpc]
 [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81063330>] ? default_wake_function+0x0/0x20
 [<ffffffffa063f9b8>] ? __ptlrpc_prep_bulk_page+0x68/0x170 [ptlrpc]
 [<ffffffffa0cbc364>] ost_brw_write+0x1034/0x15d0 [ost]
 [<ffffffffa06111a0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa0cc242b>] ost_handle+0x3ecb/0x48e0 [ost]
 [<ffffffffa03720f4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa06613c8>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
 [<ffffffffa0377e05>] ? lc_watchdog_touch+0xd5/0x170 [libcfs]
 [<ffffffffa0658729>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffffa03762d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
 [<ffffffff81055ad3>] ? __wake_up+0x53/0x70
 [<ffffffffa066275e>] ptlrpc_main+0xace/0x1700 [ptlrpc]
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffffa0661c90>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1379967822.3577
attached is the lustre-log

We will upgrade the 2.4.1 GA if this issue is fixed in the patches between build #40 and GA version.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1379967822.3577.txt.gz
23/Sep/13 8:54 PM
3.39 MB
Minh Diep

ll_ost_io process hung after mounting 1.8.9 client on 2.4.1 server (after upgraded)

Details

Description

Attachments

Attachments

Activity

People

Dates