Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-659

Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

Details

    • Epic
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • None

    Description

      Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

      2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
      2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
      2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
      2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
      2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
      2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
      2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
      2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

      and:

      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
      >
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
      nds that need a credit) <kern.err>
      2011-09-02 03:36:28 rs2166 Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: early_rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_nocred: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue_rsrvd: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: tx_queue: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
      10339d4e1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
      10339d3ea <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
      1033a9074 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
      1033a851c <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
      1033b31b1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
      1033b2a31 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
      1033bff6d <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
      1033bf8fc <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: active_txs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
      1033d0ce5 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: rxs: <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
      2011-09-02 03:36:28 rs2166 Lustre: ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
      2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker
      n.err>

      On the server side we are seeing lot of messages similar to this:

      2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
      2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
      only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
      2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

      and:

      2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
      2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
      d reconnection, still busy with 1 active RPCs <kern.warning>

      When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

      Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

      I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.

      Attachments

        1. dmesg
          121 kB
        2. oss-scratch16.1314981155.10959
          27 kB
        3. slabinfo
          17 kB

        Issue Links

          Activity

            [LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release)

            If there is no swap on the nodes, then the kernel has to work much harder in case of memory pressure. If you are booting diskless I would still recommend having a small swap partition on the server nodes.

            The pauses during boot may well relate to this, if the boot code is doing any upcalls to load configs or firmware.

            adilger Andreas Dilger added a comment - If there is no swap on the nodes, then the kernel has to work much harder in case of memory pressure. If you are booting diskless I would still recommend having a small swap partition on the server nodes. The pauses during boot may well relate to this, if the boot code is doing any upcalls to load configs or firmware.
            jamervi Joe Mervini added a comment -

            Andres, At this point we have decided to pull back on any modifications to the system in favor of trying to isolate the problem by removing the local software raid lustre scratch file system from the equation and having users run out of our site file system. I think for now, we should plan on taking this up again Tuesday as this is a long holiday weekend for us. But I was wondering if you had any opinion on the blocked tasks during boot up? One thing that we failed to mention that is relevant to the environment is that the system is essentially a diskless configuration where the OS is delivered to all nodes - with the exception of bootnodes - via NFS over IB.

            jamervi Joe Mervini added a comment - Andres, At this point we have decided to pull back on any modifications to the system in favor of trying to isolate the problem by removing the local software raid lustre scratch file system from the equation and having users run out of our site file system. I think for now, we should plan on taking this up again Tuesday as this is a long holiday weekend for us. But I was wondering if you had any opinion on the blocked tasks during boot up? One thing that we failed to mention that is relevant to the environment is that the system is essentially a diskless configuration where the OS is delivered to all nodes - with the exception of bootnodes - via NFS over IB.

            Joe, if you build your own Lustre packages, you could first try running configure with the "--disable-lru-resize" option (along with any other options you normally use) and installing this on some/all of the OSS nodes.

            adilger Andreas Dilger added a comment - Joe, if you build your own Lustre packages, you could first try running configure with the "--disable-lru-resize" option (along with any other options you normally use) and installing this on some/all of the OSS nodes.
            pjones Peter Jones added a comment -

            Niu

            Could you please help out with this one?

            Peter

            pjones Peter Jones added a comment - Niu Could you please help out with this one? Peter

            Joe, I'm assuming that since you run the chaos release that this is the same code that is being run at LLNL? Is this problem specific to a particular user job, or is it happening regardless of the type of job being run?

            If I provided you with a Lustre patch to avoid the allocations recursing into the DLM, can you build it, or do you normally get pre-built RPMs?

            The other question is if it would be worthwhile to try the 1.8.6.wc1 release on a handful of OSS nodes to see if it avoids the problem?

            adilger Andreas Dilger added a comment - Joe, I'm assuming that since you run the chaos release that this is the same code that is being run at LLNL? Is this problem specific to a particular user job, or is it happening regardless of the type of job being run? If I provided you with a Lustre patch to avoid the allocations recursing into the DLM, can you build it, or do you normally get pre-built RPMs? The other question is if it would be worthwhile to try the 1.8.6.wc1 release on a handful of OSS nodes to see if it avoids the problem?
            jamervi Joe Mervini added a comment -

            The scratch file system was only usable for a short period of time after I posted my last remarks (I was away for several hours but others who were monitoring the system reported hung clients and load averages on the servers in the hundreds.) So I think it is safe to say that turning off read cache provided no relief.

            jamervi Joe Mervini added a comment - The scratch file system was only usable for a short period of time after I posted my last remarks (I was away for several hours but others who were monitoring the system reported hung clients and load averages on the servers in the hundreds.) So I think it is safe to say that turning off read cache provided no relief.
            jamervi Joe Mervini added a comment -

            Andres, the short answer is: yes, the OSSs were still hanging. Today I preemptively restarted all the OSS servers disabling read_cache_enable at mount time. We'll see if there are any positive effects.

            One thing that I did notice that I hadn't before is that I am also seeing blocked tasks on the OSS nodes at during boot time (below). Since we do not start either lustre or mdadm automatically (we do it via nodeup scripts) I personally can not see how either are contributing to these messages being generated. But that is beyond my understanding of both linux and lustre internals.

            INFO: task modprobe:1778 blocked for more than 120 seconds.
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            modprobe D ffff810001004440 0 1778 1742 (NOTLB)
            ffff8101dcf0fd78 0000000000000046 ffff81025e5eefff ffff8101de3de678
            0000000000000001 0000000000000007 ffff8101ff54b040 ffffffff80324b60
            00000055e1c307ae 0000000000002c68 ffff8101ff54b228 0000000080091f8c
            Call Trace:
            [<ffffffff80067b32>] __down+0xc5/0xd9
            [<ffffffff80093b5a>] default_wake_function+0x0/0xf
            [<ffffffff800677dd>] __down_failed+0x35/0x3a
            [<ffffffff801dc223>] __driver_attach+0x0/0xb6
            [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
            [<ffffffff801dc223>] __driver_attach+0x0/0xb6
            [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
            [<ffffffff801dc051>] driver_attach+0x1c/0x1e
            [<ffffffff801db5ee>] bus_add_driver+0x78/0x116
            [<ffffffff801dc4fc>] driver_register+0x8f/0x93
            [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
            [<ffffffff882f4017>] :pata_acpi:pacpi_init+0x17/0x19
            [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
            [<ffffffff80060116>] system_call+0x7e/0x83

            INFO: task modprobe:1901 blocked for more than 120 seconds.
            "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            modprobe D ffff810205462040 0 1901 1809 (NOTLB)
            ffff8101dce3fd58 0000000000000046 00000002fe821000 0000000000000002
            00000006dce3fd48 0000000000000007 ffff8101fe24c7f0 ffff8102056bb7f0
            00000055e16baed4 0000000000002556 ffff8101fe24c9d8 0000000600000096
            Call Trace:
            [<ffffffff80067b32>] __down+0xc5/0xd9
            [<ffffffff80093b5a>] default_wake_function+0x0/0xf
            [<ffffffff800677dd>] __down_failed+0x35/0x3a
            [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
            [<ffffffff801dc223>] __driver_attach+0x0/0xb6
            [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
            [<ffffffff801dc051>] driver_attach+0x1c/0x1e
            [<ffffffff801db5ee>] bus_add_driver+0x78/0x116
            [<ffffffff801dc4fc>] driver_register+0x8f/0x93
            [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
            [<ffffffff883d5074>] :i7core_edac:i7core_init+0x74/0x98
            [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
            [<ffffffff80060116>] system_call+0x7e/0x83

            jamervi Joe Mervini added a comment - Andres, the short answer is: yes, the OSSs were still hanging. Today I preemptively restarted all the OSS servers disabling read_cache_enable at mount time. We'll see if there are any positive effects. One thing that I did notice that I hadn't before is that I am also seeing blocked tasks on the OSS nodes at during boot time (below). Since we do not start either lustre or mdadm automatically (we do it via nodeup scripts) I personally can not see how either are contributing to these messages being generated. But that is beyond my understanding of both linux and lustre internals. INFO: task modprobe:1778 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. modprobe D ffff810001004440 0 1778 1742 (NOTLB) ffff8101dcf0fd78 0000000000000046 ffff81025e5eefff ffff8101de3de678 0000000000000001 0000000000000007 ffff8101ff54b040 ffffffff80324b60 00000055e1c307ae 0000000000002c68 ffff8101ff54b228 0000000080091f8c Call Trace: [<ffffffff80067b32>] __down+0xc5/0xd9 [<ffffffff80093b5a>] default_wake_function+0x0/0xf [<ffffffff800677dd>] __down_failed+0x35/0x3a [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6 [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a [<ffffffff801dc051>] driver_attach+0x1c/0x1e [<ffffffff801db5ee>] bus_add_driver+0x78/0x116 [<ffffffff801dc4fc>] driver_register+0x8f/0x93 [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7 [<ffffffff882f4017>] :pata_acpi:pacpi_init+0x17/0x19 [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5 [<ffffffff80060116>] system_call+0x7e/0x83 INFO: task modprobe:1901 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. modprobe D ffff810205462040 0 1901 1809 (NOTLB) ffff8101dce3fd58 0000000000000046 00000002fe821000 0000000000000002 00000006dce3fd48 0000000000000007 ffff8101fe24c7f0 ffff8102056bb7f0 00000055e16baed4 0000000000002556 ffff8101fe24c9d8 0000000600000096 Call Trace: [<ffffffff80067b32>] __down+0xc5/0xd9 [<ffffffff80093b5a>] default_wake_function+0x0/0xf [<ffffffff800677dd>] __down_failed+0x35/0x3a [<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6 [<ffffffff801dc223>] __driver_attach+0x0/0xb6 [<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a [<ffffffff801dc051>] driver_attach+0x1c/0x1e [<ffffffff801db5ee>] bus_add_driver+0x78/0x116 [<ffffffff801dc4fc>] driver_register+0x8f/0x93 [<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7 [<ffffffff883d5074>] :i7core_edac:i7core_init+0x74/0x98 [<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5 [<ffffffff80060116>] system_call+0x7e/0x83

            Joe, when you write "did not provide any positive results" does that mean the OSS nodes are still hanging? Had they already dumped stacks (watchdog stack dumps due to inactivity, with ldlm_pools_srv_shrink() on the stack)? If that is the case, then disabling read cache would be too late, and would not "unblock" the OSS. Any OSS nodes that are already hung need to be rebooted, and the "lctl set_param" (or equivalent "echo 0 >") needs to be run right after boot.

            adilger Andreas Dilger added a comment - Joe, when you write "did not provide any positive results" does that mean the OSS nodes are still hanging? Had they already dumped stacks (watchdog stack dumps due to inactivity, with ldlm_pools_srv_shrink() on the stack)? If that is the case, then disabling read cache would be too late, and would not "unblock" the OSS. Any OSS nodes that are already hung need to be rebooted, and the "lctl set_param" (or equivalent "echo 0 >") needs to be run right after boot.
            jamervi Joe Mervini added a comment -

            Andres - I meant to mention in my last post that I did turn off read_cache_enable poking the proc entry via "cat 0 > /proc/fs/lsutre/obdfilter/<OST>/read_cache_enable". I am assuming this accomplish the same thing as "lctl set_param obdfilter.*.readcache_enable=0". Doing this on the live file system did not provide any positive results.

            I also check the state of /proc/fs/lustre/ldlm/namespaces/filter-<OST>/max_nolock_bytes on each of the 4 OSTs and they are all set to 0.

            jamervi Joe Mervini added a comment - Andres - I meant to mention in my last post that I did turn off read_cache_enable poking the proc entry via "cat 0 > /proc/fs/lsutre/obdfilter/<OST>/read_cache_enable". I am assuming this accomplish the same thing as "lctl set_param obdfilter.*.readcache_enable=0". Doing this on the live file system did not provide any positive results. I also check the state of /proc/fs/lustre/ldlm/namespaces/filter-<OST>/max_nolock_bytes on each of the 4 OSTs and they are all set to 0.
            pjones Peter Jones added a comment -

            Johann,

            Can you please comment?

            Thanks

            Peter

            pjones Peter Jones added a comment - Johann, Can you please comment? Thanks Peter

            Doh, 1.8.3, if I actually open my eyes...

            adilger Andreas Dilger added a comment - Doh, 1.8.3, if I actually open my eyes...

            People

              niu Niu Yawei (Inactive)
              jamervi Joe Mervini
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: