[LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release) - Whamcloud Community JIRA

Details

Type: Epic
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- o2iblnd
Environment:

Hide
Redsky Cluster - Oracle (Sun) C48 Blade servers ~2700 nodes, running LLNL Chaos version 4.4.3 (TOSS 1.4.3) with lustre version 1.8.5.0-5. Blades consist of 2 nahalem 4 core procs with 12GB mem. Networking is QDR-IB only using a 3D torus routing algorithm. Storage is software RAID 6 8+2 (mdadm) running on Oracle (Sun) J4400 JBODs. More specifically:

OFED1.5
Kernel 2.6.18-107 + two patches:
Lustre kernel patch (raid10-soft-lockups.patch)
OFED kernel patch (mad-qp-size-tunable.patch)
Lustre-1.8.5.0-3chaos, which seems to be Lustre-1.8.5.0-3 + five patches:
ff2ef0c ~~LU-337~~ Fix alloc mask in alloc_qinfo()
f9e0e36 ~~LU-234~~ OOM killer causes node hang.
09eb8f9 ~~LU-286~~ racer: general protection fault.
f5a9068 ~~LU-274~~ Update LVB from disk when glimpse callback return error
c4d695f Add IP to error message when peer's IB port is not privileged.

Show
Redsky Cluster - Oracle (Sun) C48 Blade servers ~2700 nodes, running LLNL Chaos version 4.4.3 (TOSS 1.4.3) with lustre version 1.8.5.0-5. Blades consist of 2 nahalem 4 core procs with 12GB mem. Networking is QDR-IB only using a 3D torus routing algorithm. Storage is software RAID 6 8+2 (mdadm) running on Oracle (Sun) J4400 JBODs. More specifically: OFED1.5 Kernel 2.6.18-107 + two patches: Lustre kernel patch (raid10-soft-lockups.patch) OFED kernel patch (mad-qp-size-tunable.patch) Lustre-1.8.5.0-3chaos, which seems to be Lustre-1.8.5.0-3 + five patches: ff2ef0c LU-337 Fix alloc mask in alloc_qinfo() f9e0e36 LU-234 OOM killer causes node hang. 09eb8f9 LU-286 racer: general protection fault. f5a9068 LU-274 Update LVB from disk when glimpse callback return error c4d695f Add IP to error message when peer's IB port is not privileged.

Epic:
- hang
- server
- timeout
Rank (Obsolete):
10101

Description

Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

and:

2011-09-02 03:36:28 rs2166 >
2011-09-02 03:36:28 rs2166 nds that need a credit) <kern.e 2011-09-02 03:36:28 rs2166 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 10339d4e1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 10339d3ea <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033a9074 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033a851c <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033b31b1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033b2a31 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033bff6d <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033bf8fc <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 1033d0ce5 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 n.err> LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
rr>
Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
early_rxs: <kern.info>
tx_queue_nocred: <kern.info>
tx_queue_rsrvd: <kern.info>
tx_queue: <kern.info>
ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
active_txs: <kern.info>
ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
rxs: <kern.info>
ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker

On the server side we are seeing lot of messages similar to this:

2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
d reconnection, still busy with 1 active RPCs <kern.warning>

When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dmesg
121 kB
03/Oct/11 2:57 PM
oss-scratch16.1314981155.10959
27 kB
02/Sep/11 2:36 PM
slabinfo
17 kB
03/Oct/11 2:57 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates