[LU-659] Experiencing heavy IO load, client eviction and RPC timeouts after upgrade to lustre-1.8.5.0-5 (chaos release) Created: 02/Sep/11  Updated: 16/Nov/11  Resolved: 27/Oct/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Epic Priority: Critical
Reporter: Joe Mervini Assignee: Niu Yawei (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: o2iblnd
Environment:

Redsky Cluster - Oracle (Sun) C48 Blade servers ~2700 nodes, running LLNL Chaos version 4.4.3 (TOSS 1.4.3) with lustre version 1.8.5.0-5. Blades consist of 2 nahalem 4 core procs with 12GB mem. Networking is QDR-IB only using a 3D torus routing algorithm. Storage is software RAID 6 8+2 (mdadm) running on Oracle (Sun) J4400 JBODs. More specifically:

OFED1.5
Kernel 2.6.18-107 + two patches:
Lustre kernel patch (raid10-soft-lockups.patch)
OFED kernel patch (mad-qp-size-tunable.patch)
Lustre-1.8.5.0-3chaos, which seems to be Lustre-1.8.5.0-3 + five patches:
ff2ef0c LU-337 Fix alloc mask in alloc_qinfo()
f9e0e36 LU-234 OOM killer causes node hang.
09eb8f9 LU-286 racer: general protection fault.
f5a9068 LU-274 Update LVB from disk when glimpse callback return error
c4d695f Add IP to error message when peer's IB port is not privileged.


Attachments: File dmesg     File oss-scratch16.1314981155.10959     File slabinfo    
Epic: hang, server, timeout
Rank (Obsolete): 10101

 Description   

Since upgrading from TOSS-1.3.4 we have been experiencing MAJOR problems with stability. The one common denominator seems to be latency between lustre clients and servers. Both servers and clients are dumping lots of syslog/dmesg messages about timeouts and heavy IO loads.In particular we are seeing large volumes of messages like this from clients:

2011-09-02 10:18:49 rs249 INFO: task xsolver:23906 blocked for more than 120 seconds. <kern.err>
2011-09-02 10:18:49 rs249 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. <kern.err>
2011-09-02 10:18:49 rs249 xsolver D ffff81000101d640 0 23906 23896 23915 23905 (NOTLB) <kern.warning>
2011-09-02 10:18:49 rs249 ffff810162bc9be8 0000000000000046 0000000000000000 0000000000400000 <kern.warning>
2011-09-02 10:18:49 rs249 ffff8101d9eff000 0000000000000007 ffff8101eb2ce7f0 ffff81020554c7f0 <kern.warning>
2011-09-02 10:18:49 rs249 0000460dc26b8914 000000000000b9f5 ffff8101eb2ce9d8 00000003de55f1e8 <kern.warning>
2011-09-02 10:18:49 rs249 Call Trace: <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8002960b>] sync_page+0x0/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066812>] io_schedule+0x3f/0x63 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80029649>] sync_page+0x3e/0x42 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80066975>] __wait_on_bit_lock+0x42/0x78 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80041222>] __lock_page+0x64/0x6b <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800a822d>] wake_bit_function+0x0/0x2a <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff800140ac>] find_lock_page+0x69/0xa3 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c468d>] :lustre:ll_file_readv+0xbcd/0x2100 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888f60f8>] :lustre:ll_stats_ops_tally+0x48/0xf0 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff888c5bde>] :lustre:ll_file_read+0x1e/0x20 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff8000b80f>] vfs_read+0xcc/0x172 <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80011fef>] sys_read+0x47/0x6f <kern.warning>
2011-09-02 10:18:49 rs249 [<ffffffff80060116>] system_call+0x7e/0x83 <kern.warning>

and:

2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:2984:kiblnd_check_txs()) Timed out tx: tx_queue, 106 seconds <kern.err
>
2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3001:kiblnd_conn_timed_out()) Timed out RDMA on queue ibc_tx_queue (se
nds that need a credit) <kern.err>
2011-09-02 03:36:28 rs2166 Lustre: conn[29] ffff81037c66e1c0 [version 12] -> 10.1.36.9@o2ib: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: state 3 nposted 1/1 cred 0 o_cred 0 r_cred 8 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ready 0 scheduled -1 comms_err 0 last_send 1033d0ce5 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: early_rxs: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: tx_queue_nocred: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: tx_queue_rsrvd: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: tx_queue: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790d10 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3368 msg !- type d1 cred 2 aqt
10339d4e1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fd70 snd 0 q 1 w 0 rc 0 dl 1033b6b9a cookie 0xd3369 msg !- type d1 cred 0 aqt
10339d3ea <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790798 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c4 msg !- type d1 cred 1 aqt
1033a9074 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078eb78 snd 0 q 1 w 0 rc 0 dl 1033c177a cookie 0xd38c5 msg !- type d1 cred 0 aqt
1033a851c <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078ffc8 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3dd9 msg !- type d1 cred 2 aqt
1033b31b1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078dca0 snd 0 q 1 w 0 rc 0 dl 1033ccb2a cookie 0xd3ddb msg !- type d1 cred 0 aqt
1033b2a31 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc20010791350 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bb msg !- type d1 cred 1 aqt
1033bff6d <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc2001078fb18 snd 0 q 1 w 0 rc 0 dl 1033d86aa cookie 0xd43bc msg !- type d1 cred 1 aqt
1033bf8fc <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: active_txs: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffffc20010790860 snd 1 q 0 w 0 rc 0 dl 1033e9385 cookie 0xd4fcd msg – type d0 cred 0 aqt
1033d0ce5 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: rxs: <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd000 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd068 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd0d0 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd138 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd1a0 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd208 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd270 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd2d8 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd340 status 0 msg_type d0 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd3a8 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd410 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd478 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd4e0 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd548 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd5b0 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd618 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd680 status 0 msg_type d1 cred 2 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ffff81037defd6e8 status 0 msg_type d1 cred 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qp_state 3 cur_qp_state 3 mtu 4 mig_state 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: qkey 3411518464 rq_psn 10726368 sq_psn 13326244 dest_qp_num 2623659 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp_cap: swr 4096 rwr 32 ssge 1 rsge 1 inline 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr : dlid 21 sl 0 s_p_bits 0 rate 2 flags 0 port 1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_ah_attr(alt): dlid 12433 sl 14 s_p_bits 24 rate 0 flags 1 port 1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: pkey 0 alt_pkey 17 en 3 sq 0 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: max_rd 1 max_dest 1 min_rnr 27 port 1 <kern.info>
2011-09-02 03:36:28 rs2166 Lustre: ib_qp: timeout 19 retry 5 rnr_re 6 alt_port 1 alt_timeout 14 <kern.info>
2011-09-02 03:36:28 rs2166 LustreError: 4284:0:(o2iblnd_cb.c:3079:kiblnd_check_conns()) Timed out RDMA with 10.1.36.9@o2ib (0) <ker
n.err>

On the server side we are seeing lot of messages similar to this:

2011-09-02 05:30:59 oss-scratch14 Lustre: Service thread pid 11707 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Pid: 11707, comm: ll_ost_io_318 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 Call Trace: <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80066031>] thread_return+0x5e/0xf6 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b67>] default_wake_function+0xd/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80091f8c>] __wake_up_common+0x3e/0x68 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800532b5>] __wake_up_locked+0x13/0x15 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8029d1c9>] __down_trylock+0x1c/0x5a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-02 05:30:59 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 LustreError: dumping log to /lustre-tmp/oss-scratch14.1314928034.10857 <kern.alert>
2011-09-01 19:47:14 oss-scratch14 Lustre: scratch1-OST0034: slow start_page_write 600s due to heavy IO load <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Lustre: Service thread pid 11729 was inactive for 600.00s. The thread might be hung, or it might
only be slow and will resume later. Dumping the stack trace for debugging purposes: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Pid: 11729, comm: ll_ost_io_340 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 Call Trace: <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80067b32>] __down+0xc5/0xd9 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80093b5a>] default_wake_function+0x0/0xf <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800677dd>] __down_failed+0x35/0x3a <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887df498>] .text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800227da>] __up_read+0x7a/0x83 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8001662f>] alloc_pages_current+0x9f/0xa8 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800ce803>] __page_cache_alloc+0x6d/0x71 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800265fe>] find_or_create_page+0x37/0x7b <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af34a8>] filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88af569a>] filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae1f9>] lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88745c91>] class_handle2object+0xe1/0x170 [obdclass] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ae192>] lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88a9ff77>] ost_brw_write+0xf67/0x2410 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887ef928>] ptlrpc_send_reply+0x5f8/0x610 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3eb0>] lustre_msg_check_version_v2+0x10/0x30 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f4642>] lustre_msg_check_version+0x22/0x80 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88aa4053>] ost_handle+0x2c33/0x5690 [ost] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8015f9f8>] __next_cpu+0x19/0x28 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8007a442>] smp_send_reschedule+0x4a/0x50 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff887f3cf5>] lustre_msg_get_opc+0x35/0xf0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8880342e>] ptlrpc_server_handle_request+0x96e/0xdc0 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803b8a>] ptlrpc_wait_event+0x30a/0x320 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88804b06>] ptlrpc_main+0xf66/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8006101d>] child_rip+0xa/0x11 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff88803ba0>] ptlrpc_main+0x0/0x1110 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff80061013>] child_rip+0x0/0x11 <kern.warning>

and:

2011-09-02 10:28:09 oss-scratch14 Lustre: Skipped 11 previous similar messages <kern.warning>
2011-09-02 10:28:09 oss-scratch14 Lustre: scratch1-OST0034: Client fb914147-23d5-58f1-2781-03d1a9f96701 (at 10.1.3.226@o2ib) refuse
d reconnection, still busy with 1 active RPCs <kern.warning>

When a particular server starts getting bound we can see the load average go up to greater that 700 (on an 8 core server).

Another point: We also implemented quotas as a means to quickly see disk usage (to replace du as the defacto method), but when problems presented themselves we removed quotas from the configuration (i.e., used tunefs.lustre to reset the parameters). There is a question as to whether the quota files that may persist on the block devices may be contributing. I did see some discussion on lustre-discuss with regard to these files but there was no real information on impact if quotas were disabled or how to remove them it there was.

I have also attached one of the many lustre dumps. Hopefully this information is sufficient to at least set a starting point for analyzing the problem.



 Comments   
Comment by Oleg Drokin [ 02/Sep/11 ]

Did you always get this many IB errors?
Do you have read only cache enabled on the OST? the OST trace shows things are stuck in memory allocation, and I remember there were some fixes in that area, ie both the cache and in mem callbacks since 1.8.5

Comment by Joe Mervini [ 02/Sep/11 ]


We have read_cache_enable set to 1 (enabled) if that is what you mean. Early on (1.8.0.1) we had this disabled because of CPU soft lockups but re-enabled it when we went to 1.8.3.

Comment by Joe Mervini [ 02/Sep/11 ]

Oleg, did you mean that we should disable read_cache_enable?

Comment by Cliff White (Inactive) [ 02/Sep/11 ]

I would suggest turning off the read cache to see if it helps.

Comment by Andreas Dilger [ 02/Sep/11 ]

Joe, do you know what else changed in your environment as part of the TOSS 1.3.4 -> 1.4.3 upgrade? Did this include a new major version of OFED? I'm just wondering whether there are perhaps some complications at the IB layer that are causing the timeouts and reconnection? What was the previous kernel version?

IIRC (not being a networking expert, and only hearing about this tangentially) that the Sun IB topology for Red Sky was non-standard and involved changes to the subnet manager to implement the torus topology? Maybe I'm recalling some other site, but I thought I would inquire.

Secondly, with the older setup, were there limits put on the number of OST threads active at one time, that might have been lost with the upgrade (e.g. "option ost ost_num_threads=128" in /etc/modprobe.conf or similar)? How many OSTs per OSS are there? I know Kevin had done this at some of the J4400 sites in order to avoid overloading the OSTs, with the optimum performance being seen at about 32 threads per OST.

Comment by Joe Mervini [ 02/Sep/11 ]

Andreas, There were many changes to the environment as a result of the upgrade. I believe this also included changes to OFED. Also you are correct in your recollection of our IB configuration. Matthew Bohnsack is going to respond more authoritatively on that and the network topology and torus configuration.

Our configuration is 4 OSTs/OSS. With regard to the number of threads, we had always had the option of reducing the thread count in our back pocket. But we never changed it because the file system was fairly tolerant with the 512 thread pool and we wanted maximum performance. But we can always pull that trigger if you all determine that is the best course of action.

Comment by Matthew Bohnsack [ 02/Sep/11 ]

Andreas,

Before upgrade:

  • Distro: chaos-release-4.3-4.ch4.3
  • Kernel: 2.6.18-93.2redsky_chaos
  • OFED: 1.4
  • Lustre: 1.8.3.0-7

After upgrade:

  • Distro: chaos-release-4.4-3.ch4.4
  • Kernel: 2.6.18-107redsky_chaos
  • OFED: 1.5
  • Lustre: 1.8.5.0-5

There could be complications at the IB layer causing timeouts and reconnections, but we're not seeing them at the lower IB layers. We monitor the state of the IB fabric very carefully, and while these Lustre problems are occurring, a) the SM is stable, and b) we see almost zero symbol errors, receive errors, etc. on the fabric. We see zero IB counters that would indicate congestion or deadlock.

Red Sky is arranged as a 3-torus and this involved changes to the SM to make things work. In addition to the SM being aware of the torus, we needed to make sure that IB ULPs did path record queries that honored the bits set in the returned service level (SL). For example, MPI didn't do this correctly, before we fixed it. None of this has been an issue for us for a couple years, but who knows, something may have come up in the most recent OFED/kernel/Lustre that is biting us.

There's a published presentation on the Red Sky torus implementation, if you're interested. It covers things in more detail: http://www.openfabrics.org/archives/spring2010sonoma/Wednesday/9.00%20Marcus%20Epperson%20Red%20Sky/OFED%202010%20RedSky%20IB.pdf

I think Joe answered your OST thread question.

Comment by Andreas Dilger [ 02/Sep/11 ]

Looking at this more closely, it isn't clear that the server thread count is the real culprit, though I was confused to see the server load at 700, even though the maximum thread count is 512. That may be caused by the clients timing out and trying to reconnect via a separate thread, while all of the IO threads are blocked.

The stack trace shows a potential deadlock, possibly caused by "lockless IO", where the OST thread is grabbing the DLM lock on behalf of the client, but the memory pressure is causing the page allocations to recurse into the DLM and getting stuck on the lock the thread is already holding (or some other combination of A/B/C/.../A lock deadlock).

__down_failed+0x35/0x3a <kern.warning>
.text.lock.ldlm_pool+0x55/0x7d [ptlrpc] <kern.warning>
ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
shrink_slab+0xd3/0x15c <kern.warning>
alloc_pages_current+0x9f/0xa8 <kern.warning>
__page_cache_alloc+0x6d/0x71 <kern.warning>
find_or_create_page+0x37/0x7b <kern.warning>
filter_get_page+0x38/0x70 [obdfilter] <kern.warning>
filter_preprw+0x146a/0x1d30 [obdfilter] <kern.warning>
lock_handle_addref+0x9/0x10 [ptlrpc] <kern.warning>
lock_res_and_lock+0xc2/0xe0 [ptlrpc] <kern.warning>
ost_brw_write+0xf67/0x2410 [ost] <kern.warning>

Two potential short-term workarounds for this:

  • disable the read cache on the OSS nodes, to avoid memory pressure when
    allocating the pages from ost_brw_write(). They currently do not use
    GFP_NOFS, because otherwise there is no memory pressure on the OSS to
    free up pages on the OSSes.

lctl set_param obdfilter.*.readcache_enable=0

  • disable lockless IO on the OSSes for the time being. Check if this is
    enabled first, to see if this is really the culprit (though from the
    stack I would say yes):

lctl get_param ldlm.namespaces.*.max_nolock_bytes

and disable it via:
lctl set_param ldlm.namespaces.*.max_nolock_bytes=0

Comment by Andreas Dilger [ 02/Sep/11 ]

Hmm, I'm not 100% positive about the max_nolock_bytes=0 setting, since this appears to be the default. Disabling the read cache should avoid this problem I think.

Comment by Andreas Dilger [ 02/Sep/11 ]

Also, I don't see it mentioned anywhere what the previous version of Lustre was?

Comment by Andreas Dilger [ 02/Sep/11 ]

Doh, 1.8.3, if I actually open my eyes...

Comment by Peter Jones [ 02/Sep/11 ]

Johann,

Can you please comment?

Thanks

Peter

Comment by Joe Mervini [ 02/Sep/11 ]

Andres - I meant to mention in my last post that I did turn off read_cache_enable poking the proc entry via "cat 0 > /proc/fs/lsutre/obdfilter/<OST>/read_cache_enable". I am assuming this accomplish the same thing as "lctl set_param obdfilter.*.readcache_enable=0". Doing this on the live file system did not provide any positive results.

I also check the state of /proc/fs/lustre/ldlm/namespaces/filter-<OST>/max_nolock_bytes on each of the 4 OSTs and they are all set to 0.

Comment by Andreas Dilger [ 03/Sep/11 ]

Joe, when you write "did not provide any positive results" does that mean the OSS nodes are still hanging? Had they already dumped stacks (watchdog stack dumps due to inactivity, with ldlm_pools_srv_shrink() on the stack)? If that is the case, then disabling read cache would be too late, and would not "unblock" the OSS. Any OSS nodes that are already hung need to be rebooted, and the "lctl set_param" (or equivalent "echo 0 >") needs to be run right after boot.

Comment by Joe Mervini [ 03/Sep/11 ]

Andres, the short answer is: yes, the OSSs were still hanging. Today I preemptively restarted all the OSS servers disabling read_cache_enable at mount time. We'll see if there are any positive effects.

One thing that I did notice that I hadn't before is that I am also seeing blocked tasks on the OSS nodes at during boot time (below). Since we do not start either lustre or mdadm automatically (we do it via nodeup scripts) I personally can not see how either are contributing to these messages being generated. But that is beyond my understanding of both linux and lustre internals.

INFO: task modprobe:1778 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe D ffff810001004440 0 1778 1742 (NOTLB)
ffff8101dcf0fd78 0000000000000046 ffff81025e5eefff ffff8101de3de678
0000000000000001 0000000000000007 ffff8101ff54b040 ffffffff80324b60
00000055e1c307ae 0000000000002c68 ffff8101ff54b228 0000000080091f8c
Call Trace:
[<ffffffff80067b32>] __down+0xc5/0xd9
[<ffffffff80093b5a>] default_wake_function+0x0/0xf
[<ffffffff800677dd>] __down_failed+0x35/0x3a
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
[<ffffffff801dc051>] driver_attach+0x1c/0x1e
[<ffffffff801db5ee>] bus_add_driver+0x78/0x116
[<ffffffff801dc4fc>] driver_register+0x8f/0x93
[<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
[<ffffffff882f4017>] :pata_acpi:pacpi_init+0x17/0x19
[<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
[<ffffffff80060116>] system_call+0x7e/0x83

INFO: task modprobe:1901 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
modprobe D ffff810205462040 0 1901 1809 (NOTLB)
ffff8101dce3fd58 0000000000000046 00000002fe821000 0000000000000002
00000006dce3fd48 0000000000000007 ffff8101fe24c7f0 ffff8102056bb7f0
00000055e16baed4 0000000000002556 ffff8101fe24c9d8 0000000600000096
Call Trace:
[<ffffffff80067b32>] __down+0xc5/0xd9
[<ffffffff80093b5a>] default_wake_function+0x0/0xf
[<ffffffff800677dd>] __down_failed+0x35/0x3a
[<ffffffff801dc2c2>] __driver_attach+0x9f/0xb6
[<ffffffff801dc223>] __driver_attach+0x0/0xb6
[<ffffffff801db9e9>] bus_for_each_dev+0x49/0x7a
[<ffffffff801dc051>] driver_attach+0x1c/0x1e
[<ffffffff801db5ee>] bus_add_driver+0x78/0x116
[<ffffffff801dc4fc>] driver_register+0x8f/0x93
[<ffffffff8016dff8>] __pci_register_driver+0x5a/0xb7
[<ffffffff883d5074>] :i7core_edac:i7core_init+0x74/0x98
[<ffffffff800aeb64>] sys_init_module+0xb0/0x1f5
[<ffffffff80060116>] system_call+0x7e/0x83

Comment by Joe Mervini [ 03/Sep/11 ]

The scratch file system was only usable for a short period of time after I posted my last remarks (I was away for several hours but others who were monitoring the system reported hung clients and load averages on the servers in the hundreds.) So I think it is safe to say that turning off read cache provided no relief.

Comment by Andreas Dilger [ 04/Sep/11 ]

Joe, I'm assuming that since you run the chaos release that this is the same code that is being run at LLNL? Is this problem specific to a particular user job, or is it happening regardless of the type of job being run?

If I provided you with a Lustre patch to avoid the allocations recursing into the DLM, can you build it, or do you normally get pre-built RPMs?

The other question is if it would be worthwhile to try the 1.8.6.wc1 release on a handful of OSS nodes to see if it avoids the problem?

Comment by Peter Jones [ 04/Sep/11 ]

Niu

Could you please help out with this one?

Peter

Comment by Andreas Dilger [ 04/Sep/11 ]

Joe, if you build your own Lustre packages, you could first try running configure with the "--disable-lru-resize" option (along with any other options you normally use) and installing this on some/all of the OSS nodes.

Comment by Joe Mervini [ 04/Sep/11 ]

Andres, At this point we have decided to pull back on any modifications to the system in favor of trying to isolate the problem by removing the local software raid lustre scratch file system from the equation and having users run out of our site file system. I think for now, we should plan on taking this up again Tuesday as this is a long holiday weekend for us. But I was wondering if you had any opinion on the blocked tasks during boot up? One thing that we failed to mention that is relevant to the environment is that the system is essentially a diskless configuration where the OS is delivered to all nodes - with the exception of bootnodes - via NFS over IB.

Comment by Andreas Dilger [ 04/Sep/11 ]

If there is no swap on the nodes, then the kernel has to work much harder in case of memory pressure. If you are booting diskless I would still recommend having a small swap partition on the server nodes.

The pauses during boot may well relate to this, if the boot code is doing any upcalls to load configs or firmware.

Comment by Johann Lombardi (Inactive) [ 05/Sep/11 ]

> Two potential short-term workarounds for this:
>
> disable the read cache on the OSS nodes, to avoid memory pressure when
> allocating the pages from ost_brw_write(). They currently do not use
> GFP_NOFS, because otherwise there is no memory pressure on the OSS to
> free up pages on the OSSes.
> lctl set_param obdfilter.*.readcache_enable=0

Actually this just disables the cache for bulk reads, but leaves pages in the page cache for bulk writes.

e.g.:
# lctl get_param obdfilter.*.*cache_enable
obdfilter.lustre-OST0001.read_cache_enable=4294967295
obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295
# lctl set_param obdfilter.*.read_cache_enable=0
obdfilter.lustre-OST0001.read_cache_enable=0
# lctl get_param obdfilter.*.*cache_enable
obdfilter.lustre-OST0001.read_cache_enable=0
obdfilter.lustre-OST0001.writethrough_cache_enable=4294967295

To disable both, you should run:

# lctl set_param obdfilter.*.*cache_enable=0

For pages which are already in cache, you can free them with "echo 1 > /proc/sys/vm/drop_caches"
After that, you should really no longer see any memory pressure on the OSSs.

Could you please runs those two commands and check that the OSSs are never under memory pressure again (with top, free, vmstat, ...)?

> disable lockless IO on the OSSes for the time being. Check if this is
> enabled first, to see if this is really the culprit (though from the
> stack I would say yes):
> lctl get_param ldlm.namespaces.*.max_nolock_bytes
>
> and disable it via:
> lctl set_param ldlm.namespaces.*.max_nolock_bytes=0

Unless this has been explicitly enabled, lockless I/Os are disabled by default, BUT are still used for direct I/Os.

Comment by Christopher Morrone [ 06/Sep/11 ]

FYI, if you are using our tree and having o2iblnd problems, we have an additional couple of histograms that might be useful in:

/proc/sys/lnet/o2iblnd/stats

Comment by Niu Yawei (Inactive) [ 07/Sep/11 ]

Hi, Joe

Any news on the problem isolation? I think it's worth while to try disabling the lru-resize as Andreas suggested, and it'll be great if we can catch the full stack on mds when you hit the problem again.

Thanks

Comment by Joe Mervini [ 08/Sep/11 ]

Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU).

WRT rebuilding lustre with lru-resize disabled, we have not gone down that path yet. Because of the complexity of this particular environment (diskless, IB only network, etc.) it is more complicated than a simple drop-in of the new bits on the servers. However, we have implemented the non-invasive changes that Johann has recommended and will start testing today.

Comment by Joe Mervini [ 08/Sep/11 ]

One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern?

Comment by Niu Yawei (Inactive) [ 08/Sep/11 ]

One other observation: It appears that we have been kind of chasing our tails with the "...blocked for more than 120 seconds" messages we were seeing in dmesg and syslogs. Come to find out that this particular error message is new with the new kernel (We checked through the source of the previous kernel and this reporting didn't exist). Based on the volume of questions posted on the web, it seems like this is a commonly observed condition. The question is, is there reason for concern?

When some process hasn't been scheduled for a long time (usually caused by deadlock or the system is overload), kernel will print such kind of message to warn user that there is something wrong.

Comment by Niu Yawei (Inactive) [ 08/Sep/11 ]

Niu, We are starting to look at user(s) jobs as the source of the problem. We had a similar problem that occurred on another one of our lustre file systems where the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories (this was reported separately in another LU).

What's the LU number?

Comment by Joe Mervini [ 08/Sep/11 ]

The other LU is LU-667

Comment by Niu Yawei (Inactive) [ 08/Sep/11 ]

the MDS was getting buried by multiple user jobs (same user) were running chgrp recursively on a directory with ~4.5M files/directories

Hi, Joe

How many clients are performing such jobs? are they working on same directory? and what's the client's memory size and MDS memory size? Did you run the same jobs and hit the same problem before upgrading?

Now I'm suspecting that with lru-resize enabled, client can cache much more dlm locks, though MDS can inform client to cancel locks when it's getting overloaded, the client might not be able to response quickly enough to releive the situation.

If it's possible, it's better to get the full stack trace on MDS and the statistics of /proc/fs/lustre/ldlm/namespaces/xxx/* and pool on both client and MDS when you hit the problem again. Thank you.

Comment by Steve Monk (Inactive) [ 14/Sep/11 ]

FYI..To allow our users to get some work done, we are limiting the use of this Red Sky local Lustre file system in favor of our site Lustre file system. We plan on testing the fixes mentioned next week as part of a maintenance time and will report back accordingly.
Thanks,
Steve

Comment by Joe Mervini [ 26/Sep/11 ]

During our dedicated application time we were able to recreate the high load and lustre hang condition limited to 4 user jobs - with the suspicion that either one or both of 2 codes being the trigger. Other fallout that occurred with this hang condition that wasn't mentioned before is that we began having problems with booting servers and client nodes via gpxe over IB, and in the case when the problem first arose, a complete system restart including a flush of the subnet manager was require to get the system operational again.

Today we are going to try and duplicate the condition on our test platform. If we are successful, we try to zero in on which code is the culprit. Otherwise we suspect it might be a scaling issue.

One question that came up in a meeting this morning: Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric.

Comment by Andreas Dilger [ 26/Sep/11 ]

.bq > Is it possible to assign LNET its own P_KEY when running over IB? The idea is that if we isolate LNET traffic to it's own lane and QOS we can see whether it is causing contention on the fabric.

I believe this is possible, and some discussion/patches about this a year ago on lustre-discuss, IIRC. I don't know what the state of that is today, however. I've CC'd Liang to see if he recalls the details.

Comment by Joe Mervini [ 03/Oct/11 ]

We are experiencing another interruption due to high load averages on another one of our file systems. The parameters are essentially the same with the exception that this file system is running RAID10 OSTs vs. RAID6.

The predominant factors are the load is above 150 and there are a couple of ll_ost_io threads and a kjournald thread are eating up a large percentage of the CPU utilization:

top - 12:40:17 up 31 days, 4:50, 3 users, load average: 151.00, 150.86, 150.49
Tasks: 1176 total, 2 running, 1174 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 23.7%sy, 0.0%ni, 75.5%id, 0.6%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 11814704k total, 11753832k used, 60872k free, 355376k buffers
Swap: 0k total, 0k used, 0k free, 8289304k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7747 root 11 -5 0 0 0 D 63.1 0.0 452:00.23 kjournald
8004 root 15 0 0 0 0 D 62.8 0.0 352:06.50 ll_ost_io_94
10329 root 15 0 0 0 0 R 61.5 0.0 397:11.57 ll_ost_io_243
24954 root 16 0 11664 2096 888 S 0.7 0.0 0:02.97 top
8083 root 10 -5 0 0 0 S 0.6 0.0 240:03.48 md2_raid10
7569 root 10 -5 0 0 0 S 0.5 0.0 239:52.87 md1_raid10

I have captured the outputs from sysrq -t and -m as well as the slabinfo and have them attached. I have tried unsuccessfully to kill the ll_ost_io threads. I saw some mention of an OOM killer but it looks like that is only in 2.1. At this point I need to reboot the system to restore services. Hopefully the information provided can some insight into what is happening.

Comment by Oleg Drokin [ 03/Oct/11 ]

It does look like a high mem pressure on the node.
In the backtraces you provided I can only see ll_ost_io_243 in the mempressure path, but not the io_94 or journald, so I don't know what the other two are doing.

From the slabinfo it looks like you have 400k locks or so too.
Any chance you can limit your clients to a certain preset number of locks and see what the impact would be?

Just echo some number like 600 to /proc/fs/lustre/ldlm/namespaces/osc/lru_size on all clients to achieve that.

Also it is a bad idea to try to kill ost threads, so please don't do it (it won't work anyway too).

Comment by Joe Mervini [ 03/Oct/11 ]

Thank Oleg. I'll give that a try. Also I know that killing off ost threads is not a good idea but I was hoping that if I was able to I could find the node or job that was creating the problem.

Comment by Minh Diep [ 21/Oct/11 ]

Hi Joe,

Have you tried disable oss read cache?

Comment by Johann Lombardi (Inactive) [ 26/Oct/11 ]

> Have you tried disable oss read cache?

I suggested this a while back (05/Sep/11 8:01 AM), however i don't know whether or not it has been tried.

Comment by Joe Mervini [ 26/Oct/11 ]

Sorry for the slow response. Basically we had been running for most of the life of this file system with read caching turned off because of earlier 1.8 issues. We re-enabled it for a brief time before we started encountering these problems and part of our trouble shooting efforts were to return the system to the environmental conditions (with the exception of changes due to the OS and other upgrades mentioned above). It didn't have any impact.

We now believe that a combination of user jobs/code (most likely single node), software RAID are creating a deadlock condition that impacts the entire system catastrophically. Because I renders the system unusable we have opted to sideline this particular file system and are using our global scratch file system to handle that need.

I think for the time being we should do the same with this ticket. We have seen evidence that the same problem might be occurring in another fashion and it may be something we can possibly create a duplicator. If we can I will either reopen this case or create a new one depending on how you would like to approach it.

Thanks.

Comment by Joe Mervini [ 26/Oct/11 ]

Man - after a long day you don't notice all your typos until you see it posted. Hopefully you get the gist of my message. If not or if there are any questions just let me know.

Comment by Peter Jones [ 27/Oct/11 ]

ok Joe. Thanks for the update. We'll close this ticket for now and either reopen it or open a new ticket if you have some more information in the future.

Comment by Minh Diep [ 16/Nov/11 ]

Just FYI, We no longer see the following watchdog when we disable oss read cache

2011-09-01 19:47:14 oss-scratch14 [<ffffffff887dd3f2>] ldlm_pools_srv_shrink+0x12/0x20 [ptlrpc] <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800409ea>] shrink_slab+0xd3/0x15c <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d487f>] zone_reclaim+0x25f/0x306 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff800d0af1>] __rmqueue+0x47/0xcb <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000a96a>] get_page_from_freelist+0xb6/0x411 <kern.warning>
2011-09-01 19:47:14 oss-scratch14 [<ffffffff8000f5ad>] __alloc_pages+0x78/0x30e <kern.warning>

Generated at Sat Feb 10 01:09:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.