Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-659

Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

Details

    • Epic
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None

    Description

      Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

      2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
      2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
      2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
      2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
      2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
      2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
      2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

      and:

      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
      >
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
      nds that need a credit) <kern.err>
      2011-09-02 03:36:28 rs2166 Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: early_rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_nocred: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_rsrvd: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
      10339d4e1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
      10339d3ea <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
      1033a9074 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
      1033a851c <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
      1033b31b1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
      1033b2a31 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
      1033bff6d <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
      1033bf8fc <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: active_txs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
      1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker
      n.err>

      On the server side we are seeing lot of messages similar to this:

      2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
      2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
      2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
      d reconnection, still busy with 1 active RPCs <kern.warning>

      When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

      Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

      I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.

      Attachments

        1. dmesg
          121 kB
        2. oss-scratch16.1314981155.10959
          27 kB
        3. slabinfo
          17 kB

        Issue Links

          Activity

            [LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)
            jamervi Joe Mervini added a comment -

            During our dedicated application time we were able to recreate the high load and lustre hang condition limited to 4 user jobs - with the suspicion that either one or both of 2 codes being the trigger. Other fallout that occurred with this hang condition that wasn't mentioned before is that we began having problems with booting servers and client nodes via gpxe over IB, and in the case when the problem first arose, a complete system restart including a flush of the subnet manager was require to get the system operational again.

            Today we are going to try and duplicate the condition on our test platform. If we are successful, we try to zero in on which code is the culprit. Otherwise we suspect it might be a scaling issue.

            One question that came up in a meeting this morning: Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric.

            jamervi Joe Mervini added a comment - During our dedicated application time we were able to recreate the high load and lustre hang condition limited to 4 user jobs - with the suspicion that either one or both of 2 codes being the trigger. Other fallout that occurred with this hang condition that wasn't mentioned before is that we began having problems with booting servers and client nodes via gpxe over IB, and in the case when the problem first arose, a complete system restart including a flush of the subnet manager was require to get the system operational again. Today we are going to try and duplicate the condition on our test platform. If we are successful, we try to zero in on which code is the culprit. Otherwise we suspect it might be a scaling issue. One question that came up in a meeting this morning: Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric.

            FYI..To allow our users to get some work done, we are limiting the use of this Red Sky local Lustre file system in favor of our site Lustre file system. We plan on testing the fixes mentioned next week as part of a maintenance time and will report back accordingly.
            Thanks,
            Steve

            smonk Steve Monk (Inactive) added a comment - FYI..To allow our users to get some work done, we are limiting the use of this Red Sky local Lustre file system in favor of our site Lustre file system. We plan on testing the fixes mentioned next week as part of a maintenance time and will report back accordingly. Thanks, Steve

            the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories

            Hi, Joe

            How many clients are performing such jobs? are they working on same directory? and what's the client's memory size and MDS memory size? Did you run the same jobs and hit the same problem before upgrading?

            Now I'm suspecting that with lru-resize enabled, client can cache much more dlm locks, though MDS can inform client to cancel locks when it's getting overloaded, the client might not be able to response quickly enough to releive the situation.

            If it's possible, it's better to get the full stack trace on MDS and the statistics of /proc/fs/lustre/ldlm/namespaces/xxx/* and pool on both client and MDS when you hit the problem again. Thank you.

            niu Niu Yawei (Inactive) added a comment - the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories Hi, Joe How many clients are performing such jobs? are they working on same directory? and what's the client's memory size and MDS memory size? Did you run the same jobs and hit the same problem before upgrading? Now I'm suspecting that with lru-resize enabled, client can cache much more dlm locks, though MDS can inform client to cancel locks when it's getting overloaded, the client might not be able to response quickly enough to releive the situation. If it's possible, it's better to get the full stack trace on MDS and the statistics of /proc/fs/lustre/ldlm/namespaces/xxx/* and pool on both client and MDS when you hit the problem again. Thank you.
            jamervi Joe Mervini added a comment -

            The other LU is LU-667

            jamervi Joe Mervini added a comment - The other LU is LU-667

            Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU).

            What's the LU number?

            niu Niu Yawei (Inactive) added a comment - Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU). What's the LU number?

            One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern?

            When some process hasn't been scheduled for a long time (usually caused by deadlock or the system is overload), kernel will print such kind of message to warn user that there is something wrong.

            niu Niu Yawei (Inactive) added a comment - One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern? When some process hasn't been scheduled for a long time (usually caused by deadlock or the system is overload), kernel will print such kind of message to warn user that there is something wrong.
            jamervi Joe Mervini added a comment -

            One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern?

            jamervi Joe Mervini added a comment - One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern?
            jamervi Joe Mervini added a comment -

            Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU).

            WRT rebuilding lustre with lru-resize disabled, we have not gone down that path yet. Because of the complexity of this particular environment (diskless, IB only network, etc.) it is more complicated than a simple drop-in of the new bits on the servers. However, we have implemented the non-invasive changes that Johann has recommended and will start testing today.

            jamervi Joe Mervini added a comment - Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU). WRT rebuilding lustre with lru-resize disabled, we have not gone down that path yet. Because of the complexity of this particular environment (diskless, IB only network, etc.) it is more complicated than a simple drop-in of the new bits on the servers. However, we have implemented the non-invasive changes that Johann has recommended and will start testing today.

            Hi, Joe

            Any news on the problem isolation? I think it's worth while to try disabling the lru-resize as Andreas suggested, and it'll be great if we can catch the full stack on mds when you hit the problem again.

            Thanks

            niu Niu Yawei (Inactive) added a comment - Hi, Joe Any news on the problem isolation? I think it's worth while to try disabling the lru-resize as Andreas suggested, and it'll be great if we can catch the full stack on mds when you hit the problem again. Thanks

            FYI, if you are using our tree and having o2iblnd problems, we have an additional couple of histograms that might be useful in:

            /proc/sys/lnet/o2iblnd/stats

            morrone Christopher Morrone (Inactive) added a comment - FYI, if you are using our tree and having o2iblnd problems, we have an additional couple of histograms that might be useful in: /proc/sys/lnet/o2iblnd/stats

            > Two potential short-term workarounds for this:
            >
            > disable the read cache on the OSS nodes, to avoid memory pressure when
            > allocating the pages from ost_brw_write(). They currently do not use
            > GFP_NOFS, because otherwise there is no memory pressure on the OSS to
            > free up pages on the OSSes.
            > lctl set_param obdfilter.*.readcache_enable=0

            Actually this just disables the cache for bulk reads, but leaves pages in the page cache for bulk writes.

            e.g.:
            # lctl get_param obdfilter.*.*cache_enable
            obdfilter.lustre-OST0001.read_cache_enable=4294967295
            obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295
            # lctl set_param obdfilter.*.read_cache_enable=0
            obdfilter.lustre-OST0001.read_cache_enable=0
            # lctl get_param obdfilter.*.*cache_enable
            obdfilter.lustre-OST0001.read_cache_enable=0
            obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295
            

            To disable both, you should run:

            # lctl set_param obdfilter.*.*cache_enable=0
            

            For pages which are already in cache, you can free them with "echo 1 > /proc/sys/vm/drop_caches"
            After that, you should really no longer see any memory pressure on the OSSs.

            Could you please runs those two commands and check that the OSSs are never under memory pressure again (with top, free, vmstat, ...)?

            > disable lockless IO on the OSSes for the time being. Check if this is
            > enabled first, to see if this is really the culprit (though from the
            > stack I would say yes):
            > lctl get_param ldlm.namespaces.*.max_nolock_bytes
            >
            > and disable it via:
            > lctl set_param ldlm.namespaces.*.max_nolock_bytes=0

            Unless this has been explicitly enabled, lockless I/Os are disabled by default, BUT are still used for direct I/Os.

            johann Johann Lombardi (Inactive) added a comment - - edited > Two potential short-term workarounds for this: > > disable the read cache on the OSS nodes, to avoid memory pressure when > allocating the pages from ost_brw_write(). They currently do not use > GFP_NOFS, because otherwise there is no memory pressure on the OSS to > free up pages on the OSSes. > lctl set_param obdfilter.*.readcache_enable=0 Actually this just disables the cache for bulk reads, but leaves pages in the page cache for bulk writes. e.g.: # lctl get_param obdfilter.*.*cache_enable obdfilter.lustre-OST0001.read_cache_enable=4294967295 obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295 # lctl set_param obdfilter.*.read_cache_enable=0 obdfilter.lustre-OST0001.read_cache_enable=0 # lctl get_param obdfilter.*.*cache_enable obdfilter.lustre-OST0001.read_cache_enable=0 obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295 To disable both, you should run: # lctl set_param obdfilter.*.*cache_enable=0 For pages which are already in cache, you can free them with "echo 1 > /proc/sys/vm/drop_caches" After that, you should really no longer see any memory pressure on the OSSs. Could you please runs those two commands and check that the OSSs are never under memory pressure again (with top, free, vmstat, ...)? > disable lockless IO on the OSSes for the time being. Check if this is > enabled first, to see if this is really the culprit (though from the > stack I would say yes): > lctl get_param ldlm.namespaces.*.max_nolock_bytes > > and disable it via: > lctl set_param ldlm.namespaces.*.max_nolock_bytes=0 Unless this has been explicitly enabled, lockless I/Os are disabled by default, BUT are still used for direct I/Os.

            People

              niu Niu Yawei (Inactive)
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: