Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-659

Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

Details

    • Epic
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None

    Description

      Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

      2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
      2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
      2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
      2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
      2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
      2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
      2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

      and:

      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
      >
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
      nds that need a credit) <kern.err>
      2011-09-02 03:36:28 rs2166 Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: early_rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_nocred: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_rsrvd: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
      10339d4e1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
      10339d3ea <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
      1033a9074 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
      1033a851c <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
      1033b31b1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
      1033b2a31 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
      1033bff6d <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
      1033bf8fc <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: active_txs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
      1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker
      n.err>

      On the server side we are seeing lot of messages similar to this:

      2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
      2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
      2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
      d reconnection, still busy with 1 active RPCs <kern.warning>

      When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

      Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

      I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.

      Attachments

        1. dmesg
          121 kB
        2. oss-scratch16.1314981155.10959
          27 kB
        3. slabinfo
          17 kB

        Issue Links

          Activity

            [LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)
            mdiep Minh Diep added a comment -

            Just FYI, We no longer see the following watchdog when we disable oss read cache

            2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
            2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
            2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
            2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
            2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
            2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>

            mdiep Minh Diep added a comment - Just FYI, We no longer see the following watchdog when we disable oss read cache 2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning> 2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning> 2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning> 2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning> 2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning> 2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
            pjones Peter Jones added a comment -

            ok Joe. Thanks for the update. We'll close this ticket for now and either reopen it or open a new ticket if you have some more information in the future.

            pjones Peter Jones added a comment - ok Joe. Thanks for the update. We'll close this ticket for now and either reopen it or open a new ticket if you have some more information in the future.
            jamervi Joe Mervini added a comment -

            Man - after a long day you don't notice all your typos until you see it posted. Hopefully you get the gist of my message. If not or if there are any questions just let me know.

            jamervi Joe Mervini added a comment - Man - after a long day you don't notice all your typos until you see it posted. Hopefully you get the gist of my message. If not or if there are any questions just let me know.
            jamervi Joe Mervini added a comment -

            Sorry for the slow response. Basically we had been running for most of the life of this file system with read caching turned off because of earlier 1.8 issues. We re-enabled it for a brief time before we started encountering these problems and part of our trouble shooting efforts were to return the system to the environmental conditions (with the exception of changes due to the OS and other upgrades mentioned above). It didn't have any impact.

            We now believe that a combination of user jobs/code (most likely single node), software RAID are creating a deadlock condition that impacts the entire system catastrophically. Because I renders the system unusable we have opted to sideline this particular file system and are using our global scratch file system to handle that need.

            I think for the time being we should do the same with this ticket. We have seen evidence that the same problem might be occurring in another fashion and it may be something we can possibly create a duplicator. If we can I will either reopen this case or create a new one depending on how you would like to approach it.

            Thanks.

            jamervi Joe Mervini added a comment - Sorry for the slow response. Basically we had been running for most of the life of this file system with read caching turned off because of earlier 1.8 issues. We re-enabled it for a brief time before we started encountering these problems and part of our trouble shooting efforts were to return the system to the environmental conditions (with the exception of changes due to the OS and other upgrades mentioned above). It didn't have any impact. We now believe that a combination of user jobs/code (most likely single node), software RAID are creating a deadlock condition that impacts the entire system catastrophically. Because I renders the system unusable we have opted to sideline this particular file system and are using our global scratch file system to handle that need. I think for the time being we should do the same with this ticket. We have seen evidence that the same problem might be occurring in another fashion and it may be something we can possibly create a duplicator. If we can I will either reopen this case or create a new one depending on how you would like to approach it. Thanks.

            > Have you tried disable oss read cache?

            I suggested this a while back (05/Sep/11 8:01 AM), however i don't know whether or not it has been tried.

            johann Johann Lombardi (Inactive) added a comment - > Have you tried disable oss read cache? I suggested this a while back (05/Sep/11 8:01 AM), however i don't know whether or not it has been tried.
            mdiep Minh Diep added a comment -

            Hi Joe,

            Have you tried disable oss read cache?

            mdiep Minh Diep added a comment - Hi Joe, Have you tried disable oss read cache?
            jamervi Joe Mervini added a comment -

            Thank Oleg. I'll give that a try. Also I know that killing off ost threads is not a good idea but I was hoping that if I was able to I could find the node or job that was creating the problem.

            jamervi Joe Mervini added a comment - Thank Oleg. I'll give that a try. Also I know that killing off ost threads is not a good idea but I was hoping that if I was able to I could find the node or job that was creating the problem.
            green Oleg Drokin added a comment -

            It does look like a high mem pressure on the node.
            In the backtraces you provided I can only see ll_ost_io_243 in the mempressure path, but not the io_94 or journald, so I don't know what the other two are doing.

            From the slabinfo it looks like you have 400k locks or so too.
            Any chance you can limit your clients to a certain preset number of locks and see what the impact would be?

            Just echo some number like 600 to /proc/fs/lustre/ldlm/namespaces/osc/lru_size on all clients to achieve that.

            Also it is a bad idea to try to kill ost threads, so please don't do it (it won't work anyway too).

            green Oleg Drokin added a comment - It does look like a high mem pressure on the node. In the backtraces you provided I can only see ll_ost_io_243 in the mempressure path, but not the io_94 or journald, so I don't know what the other two are doing. From the slabinfo it looks like you have 400k locks or so too. Any chance you can limit your clients to a certain preset number of locks and see what the impact would be? Just echo some number like 600 to /proc/fs/lustre/ldlm/namespaces/ osc /lru_size on all clients to achieve that. Also it is a bad idea to try to kill ost threads, so please don't do it (it won't work anyway too).
            jamervi Joe Mervini added a comment -

            We are experiencing another interruption due to high load averages on another one of our file systems. The parameters are essentially the same with the exception that this file system is running RAID10 OSTs vs. RAID6.

            The predominant factors are the load is above 150 and there are a couple of ll_ost_io threads and a kjournald thread are eating up a large percentage of the CPU utilization:

            top - 12:40:17 up 31 days, 4:50, 3 users, load average: 151.00, 150.86, 150.49
            Tasks: 1176 total, 2 running, 1174 sleeping, 0 stopped, 0 zombie
            Cpu(s): 0.0%us, 23.7%sy, 0.0%ni, 75.5%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st
            Mem: 11814704k total, 11753832k used, 60872k free, 355376k buffers
            Swap: 0k total, 0k used, 0k free, 8289304k cached

            PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
            7747 root 11 -5 0 0 0 D 63.1 0.0 452:00.23 kjournald
            8004 root 15 0 0 0 0 D 62.8 0.0 352:06.50 ll_ost_io_94
            10329 root 15 0 0 0 0 R 61.5 0.0 397:11.57 ll_ost_io_243
            24954 root 16 0 11664 2096 888 S 0.7 0.0 0:02.97 top
            8083 root 10 -5 0 0 0 S 0.6 0.0 240:03.48 md2_raid10
            7569 root 10 -5 0 0 0 S 0.5 0.0 239:52.87 md1_raid10

            I have captured the outputs from sysrq -t and -m as well as the slabinfo and have them attached. I have tried unsuccessfully to kill the ll_ost_io threads. I saw some mention of an OOM killer but it looks like that is only in 2.1. At this point I need to reboot the system to restore services. Hopefully the information provided can some insight into what is happening.

            jamervi Joe Mervini added a comment - We are experiencing another interruption due to high load averages on another one of our file systems. The parameters are essentially the same with the exception that this file system is running RAID10 OSTs vs. RAID6. The predominant factors are the load is above 150 and there are a couple of ll_ost_io threads and a kjournald thread are eating up a large percentage of the CPU utilization: top - 12:40:17 up 31 days, 4:50, 3 users, load average: 151.00, 150.86, 150.49 Tasks: 1176 total, 2 running, 1174 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 23.7%sy, 0.0%ni, 75.5%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 11814704k total, 11753832k used, 60872k free, 355376k buffers Swap: 0k total, 0k used, 0k free, 8289304k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7747 root 11 -5 0 0 0 D 63.1 0.0 452:00.23 kjournald 8004 root 15 0 0 0 0 D 62.8 0.0 352:06.50 ll_ost_io_94 10329 root 15 0 0 0 0 R 61.5 0.0 397:11.57 ll_ost_io_243 24954 root 16 0 11664 2096 888 S 0.7 0.0 0:02.97 top 8083 root 10 -5 0 0 0 S 0.6 0.0 240:03.48 md2_raid10 7569 root 10 -5 0 0 0 S 0.5 0.0 239:52.87 md1_raid10 I have captured the outputs from sysrq -t and -m as well as the slabinfo and have them attached. I have tried unsuccessfully to kill the ll_ost_io threads. I saw some mention of an OOM killer but it looks like that is only in 2.1. At this point I need to reboot the system to restore services. Hopefully the information provided can some insight into what is happening.

            .bq > Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric.

            I believe this is possible, and some discussion/patches about this a year ago on lustre-discuss, IIRC. I don't know what the state of that is today, however. I've CC'd Liang to see if he recalls the details.

            adilger Andreas Dilger added a comment - .bq > Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric. I believe this is possible, and some discussion/patches about this a year ago on lustre-discuss, IIRC. I don't know what the state of that is today, however. I've CC'd Liang to see if he recalls the details.

            People

              niu Niu Yawei (Inactive)
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: