[LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release) - Whamcloud Community JIRA

Details

Type: Epic
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Labels:
- o2iblnd
Environment:

Hide
Redsky Cluster - Oracle (Sun) C48 Blade servers ~2700 nodes, running LLNL Chaos version 4.4.3 (TOSS 1.4.3) with lustre version 1.8.5.0-5. Blades consist of 2 nahalem 4 core procs with 12GB mem. Networking is QDR-IB only using a 3D torus routing algorithm. Storage is software RAID 6 8+2 (mdadm) running on Oracle (Sun) J4400 JBODs. More specifically:

OFED1.5
Kernel 2.6.18-107 + two patches:
Lustre kernel patch (raid10-soft-lockups.patch)
OFED kernel patch (mad-qp-size-tunable.patch)
Lustre-1.8.5.0-3chaos, which seems to be Lustre-1.8.5.0-3 + five patches:
ff2ef0c ~~LU-337~~ Fix alloc mask in alloc_qinfo()
f9e0e36 ~~LU-234~~ OOM killer causes node hang.
09eb8f9 ~~LU-286~~ racer: general protection fault.
f5a9068 ~~LU-274~~ Update LVB from disk when glimpse callback return error
c4d695f Add IP to error message when peer's IB port is not privileged.

Show
Redsky Cluster - Oracle (Sun) C48 Blade servers ~2700 nodes, running LLNL Chaos version 4.4.3 (TOSS 1.4.3) with lustre version 1.8.5.0-5. Blades consist of 2 nahalem 4 core procs with 12GB mem. Networking is QDR-IB only using a 3D torus routing algorithm. Storage is software RAID 6 8+2 (mdadm) running on Oracle (Sun) J4400 JBODs. More specifically: OFED1.5 Kernel 2.6.18-107 + two patches: Lustre kernel patch (raid10-soft-lockups.patch) OFED kernel patch (mad-qp-size-tunable.patch) Lustre-1.8.5.0-3chaos, which seems to be Lustre-1.8.5.0-3 + five patches: ff2ef0c LU-337 Fix alloc mask in alloc_qinfo() f9e0e36 LU-234 OOM killer causes node hang. 09eb8f9 LU-286 racer: general protection fault. f5a9068 LU-274 Update LVB from disk when glimpse callback return error c4d695f Add IP to error message when peer's IB port is not privileged.

Epic:
- hang
- server
- timeout
Rank (Obsolete):
10101

Description

Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

and:

2011-09-02 03:36:28 rs2166 >
2011-09-02 03:36:28 rs2166 nds that need a credit) <kern.e 2011-09-02 03:36:28 rs2166 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 10339d4e1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 10339d3ea <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033a9074 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033a851c <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033b31b1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033b2a31 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033bff6d <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 1033bf8fc <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 1033d0ce5 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 Lustre: 2011-09-02 03:36:28 rs2166 n.err> LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
rr>
Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
early_rxs: <kern.info>
tx_queue_nocred: <kern.info>
tx_queue_rsrvd: <kern.info>
tx_queue: <kern.info>
ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
active_txs: <kern.info>
ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
rxs: <kern.info>
ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker

On the server side we are seeing lot of messages similar to this:

2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
d reconnection, still busy with 1 active RPCs <kern.warning>

When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dmesg
121 kB
03/Oct/11 2:57 PM
oss-scratch16.1314981155.10959
27 kB
02/Sep/11 2:36 PM
slabinfo
17 kB
03/Oct/11 2:57 PM

Issue Links

Trackbacks

Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA

Activity

[LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

Andreas Dilger added a comment - 04/Sep/11 2:43 PM

If there is no swap on the nodes, then the kernel has to work much harder in case of memory pressure. If you are booting diskless I would still recommend having a small swap partition on the server nodes.

The pauses during boot may well relate to this, if the boot code is doing any upcalls to load configs or firmware.

Andreas Dilger added a comment - 04/Sep/11 2:43 PM If there is no swap on the nodes, then the kernel has to work much harder in case of memory pressure. If you are booting diskless I would still recommend having a small swap partition on the server nodes. The pauses during boot may well relate to this, if the boot code is doing any upcalls to load configs or firmware.

Joe Mervini added a comment - 04/Sep/11 12:54 PM

Andres, At this point we have decided to pull back on any modifications to the system in favor of trying to isolate the problem by removing the local software raid lustre scratch file system from the equation and having users run out of our site file system. I think for now, we should plan on taking this up again Tuesday as this is a long holiday weekend for us. But I was wondering if you had any opinion on the blocked tasks during boot up? One thing that we failed to mention that is relevant to the environment is that the system is essentially a diskless configuration where the OS is delivered to all nodes - with the exception of bootnodes - via NFS over IB.

Joe Mervini added a comment - 04/Sep/11 12:54 PM Andres, At this point we have decided to pull back on any modifications to the system in favor of trying to isolate the problem by removing the local software raid lustre scratch file system from the equation and having users run out of our site file system. I think for now, we should plan on taking this up again Tuesday as this is a long holiday weekend for us. But I was wondering if you had any opinion on the blocked tasks during boot up? One thing that we failed to mention that is relevant to the environment is that the system is essentially a diskless configuration where the OS is delivered to all nodes - with the exception of bootnodes - via NFS over IB.

Andreas Dilger added a comment - 04/Sep/11 12:09 PM

Joe, if you build your own Lustre packages, you could first try running configure with the "--disable-lru-resize" option (along with any other options you normally use) and installing this on some/all of the OSS nodes.

Andreas Dilger added a comment - 04/Sep/11 12:09 PM Joe, if you build your own Lustre packages, you could first try running configure with the "--disable-lru-resize" option (along with any other options you normally use) and installing this on some/all of the OSS nodes.

Peter Jones added a comment - 04/Sep/11 7:46 AM

Niu

Could you please help out with this one?

Peter

Peter Jones added a comment - 04/Sep/11 7:46 AM Niu Could you please help out with this one? Peter

Andreas Dilger added a comment - 04/Sep/11 1:19 AM

Joe, I'm assuming that since you run the chaos release that this is the same code that is being run at LLNL? Is this problem specific to a particular user job, or is it happening regardless of the type of job being run?

If I provided you with a Lustre patch to avoid the allocations recursing into the DLM, can you build it, or do you normally get pre-built RPMs?

The other question is if it would be worthwhile to try the 1.8.6.wc1 release on a handful of OSS nodes to see if it avoids the problem?

Andreas Dilger added a comment - 04/Sep/11 1:19 AM Joe, I'm assuming that since you run the chaos release that this is the same code that is being run at LLNL? Is this problem specific to a particular user job, or is it happening regardless of the type of job being run? If I provided you with a Lustre patch to avoid the allocations recursing into the DLM, can you build it, or do you normally get pre-built RPMs? The other question is if it would be worthwhile to try the 1.8.6.wc1 release on a handful of OSS nodes to see if it avoids the problem?

Joe Mervini added a comment - 03/Sep/11 6:42 PM

The scratch file system was only usable for a short period of time after I posted my last remarks (I was away for several hours but others who were monitoring the system reported hung clients and load averages on the servers in the hundreds.) So I think it is safe to say that turning off read cache provided no relief.

Joe Mervini added a comment - 03/Sep/11 6:42 PM The scratch file system was only usable for a short period of time after I posted my last remarks (I was away for several hours but others who were monitoring the system reported hung clients and load averages on the servers in the hundreds.) So I think it is safe to say that turning off read cache provided no relief.

Joe Mervini added a comment - 03/Sep/11 2:36 PM

Andres, the short answer is: yes, the OSSs were still hanging. Today I preemptively restarted all the OSS servers disabling read_cache_enable at mount time. We'll see if there are any positive effects.

One thing that I did notice that I hadn't before is that I am also seeing blocked tasks on the OSS nodes at during boot time (below). Since we do not start either lustre or mdadm automatically (we do it via nodeup scripts) I personally can not see how either are contributing to these messages being generated. But that is beyond my understanding of both linux and lustre internals.

INFO: task modprobe:1778 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe D ffff810001004440 0 1778 1742 (NOTLB)
ffff8101dcf0fd78 0000000000000046 ffff81025e5eefff ffff8101de3de678
0000000000000001 0000000000000007 ffff8101ff54b040 ffffffff80324b60
00000055e1c307ae 0000000000002c68 ffff8101ff54b228 0000000080091f8c
Call Trace:
[<ffffffff80067b32>] __down+0xc5/0xd9
[<ffffffff80093b5a>] default_wake_function+0x0/0xf
[<ffffffff800677dd>] __down_failed+0x35/0x3a
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
[<ffffffff801dc051>] driver_attach+0x1c/0x1e
[<ffffffff801db5ee>] bus_add_driver+0x78/0x116
[<ffffffff801dc4fc>] driver_register+0x8f/0x93
[<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
[<ffffffff882f4017>] :pata_acpi:pacpi_init+0x17/0x19
[<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
[<ffffffff80060116>] system_call+0x7e/0x83

INFO: task modprobe:1901 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe D ffff810205462040 0 1901 1809 (NOTLB)
ffff8101dce3fd58 0000000000000046 00000002fe821000 0000000000000002
00000006dce3fd48 0000000000000007 ffff8101fe24c7f0 ffff8102056bb7f0
00000055e16baed4 0000000000002556 ffff8101fe24c9d8 0000000600000096
Call Trace:
[<ffffffff80067b32>] __down+0xc5/0xd9
[<ffffffff80093b5a>] default_wake_function+0x0/0xf
[<ffffffff800677dd>] __down_failed+0x35/0x3a
[<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
[<ffffffff801dc051>] driver_attach+0x1c/0x1e
[<ffffffff801db5ee>] bus_add_driver+0x78/0x116
[<ffffffff801dc4fc>] driver_register+0x8f/0x93
[<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
[<ffffffff883d5074>] :i7core_edac:i7core_init+0x74/0x98
[<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
[<ffffffff80060116>] system_call+0x7e/0x83

Joe Mervini added a comment - 03/Sep/11 2:36 PM Andres, the short answer is: yes, the OSSs were still hanging. Today I preemptively restarted all the OSS servers disabling read_cache_enable at mount time. We'll see if there are any positive effects. One thing that I did notice that I hadn't before is that I am also seeing blocked tasks on the OSS nodes at during boot time (below). Since we do not start either lustre or mdadm automatically (we do it via nodeup scripts) I personally can not see how either are contributing to these messages being generated. But that is beyond my understanding of both linux and lustre internals. INFO: task modprobe:1778 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. modprobe D ffff810001004440 0 1778 1742 (NOTLB) ffff8101dcf0fd78 0000000000000046 ffff81025e5eefff ffff8101de3de678 0000000000000001 0000000000000007 ffff8101ff54b040 ffffffff80324b60 00000055e1c307ae 0000000000002c68 ffff8101ff54b228 0000000080091f8c Call Trace: [<ffffffff80067b32>] __down+0xc5/0xd9 [<ffffffff80093b5a>] default_wake_function+0x0/0xf [<ffffffff800677dd>] __down_failed+0x35/0x3a [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6 [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a [<ffffffff801dc051>] driver_attach+0x1c/0x1e [<ffffffff801db5ee>] bus_add_driver+0x78/0x116 [<ffffffff801dc4fc>] driver_register+0x8f/0x93 [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7 [<ffffffff882f4017>] :pata_acpi:pacpi_init+0x17/0x19 [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5 [<ffffffff80060116>] system_call+0x7e/0x83 INFO: task modprobe:1901 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. modprobe D ffff810205462040 0 1901 1809 (NOTLB) ffff8101dce3fd58 0000000000000046 00000002fe821000 0000000000000002 00000006dce3fd48 0000000000000007 ffff8101fe24c7f0 ffff8102056bb7f0 00000055e16baed4 0000000000002556 ffff8101fe24c9d8 0000000600000096 Call Trace: [<ffffffff80067b32>] __down+0xc5/0xd9 [<ffffffff80093b5a>] default_wake_function+0x0/0xf [<ffffffff800677dd>] __down_failed+0x35/0x3a [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6 [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a [<ffffffff801dc051>] driver_attach+0x1c/0x1e [<ffffffff801db5ee>] bus_add_driver+0x78/0x116 [<ffffffff801dc4fc>] driver_register+0x8f/0x93 [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7 [<ffffffff883d5074>] :i7core_edac:i7core_init+0x74/0x98 [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5 [<ffffffff80060116>] system_call+0x7e/0x83

Andreas Dilger added a comment - 03/Sep/11 5:36 AM

Joe, when you write "did not provide any positive results" does that mean the OSS nodes are still hanging? Had they already dumped stacks (watchdog stack dumps due to inactivity, with ldlm_pools_srv_shrink() on the stack)? If that is the case, then disabling read cache would be too late, and would not "unblock" the OSS. Any OSS nodes that are already hung need to be rebooted, and the "lctl set_param" (or equivalent "echo 0 >") needs to be run right after boot.

Andreas Dilger added a comment - 03/Sep/11 5:36 AM Joe, when you write "did not provide any positive results" does that mean the OSS nodes are still hanging? Had they already dumped stacks (watchdog stack dumps due to inactivity, with ldlm_pools_srv_shrink() on the stack)? If that is the case, then disabling read cache would be too late, and would not "unblock" the OSS. Any OSS nodes that are already hung need to be rebooted, and the "lctl set_param" (or equivalent "echo 0 >") needs to be run right after boot.

Joe Mervini added a comment - 02/Sep/11 10:10 PM

Andres - I meant to mention in my last post that I did turn off read_cache_enable poking the proc entry via "cat 0 > /proc/fs/lsutre/obdfilter/<OST>/read_cache_enable". I am assuming this accomplish the same thing as "lctl set_param obdfilter.*.readcache_enable=0". Doing this on the live file system did not provide any positive results.

I also check the state of /proc/fs/lustre/ldlm/namespaces/filter-<OST>/max_nolock_bytes on each of the 4 OSTs and they are all set to 0.

Joe Mervini added a comment - 02/Sep/11 10:10 PM Andres - I meant to mention in my last post that I did turn off read_cache_enable poking the proc entry via "cat 0 > /proc/fs/lsutre/obdfilter/<OST>/read_cache_enable". I am assuming this accomplish the same thing as "lctl set_param obdfilter.*.readcache_enable=0". Doing this on the live file system did not provide any positive results. I also check the state of /proc/fs/lustre/ldlm/namespaces/filter-<OST>/max_nolock_bytes on each of the 4 OSTs and they are all set to 0.

Peter Jones added a comment - 02/Sep/11 7:58 PM

Johann,

Can you please comment?

Thanks

Peter

Peter Jones added a comment - 02/Sep/11 7:58 PM Johann, Can you please comment? Thanks Peter

Andreas Dilger added a comment - 02/Sep/11 7:44 PM

Doh, 1.8.3, if I actually open my eyes...

Andreas Dilger added a comment - 02/Sep/11 7:44 PM Doh, 1.8.3, if I actually open my eyes...

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Joe Mervini

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 02/Sep/11 2:36 PM

Updated:: 16/Nov/11 2:37 PM

Resolved:: 27/Oct/11 9:44 AM