[LU-3598] Failure on test suite sanity-benchmark test_iozone: page allocation failure Created: 16/Jul/13  Updated: 13/Feb/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0, Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

server and client: lustre-master build #1560


Issue Links:
Duplicate
duplicates LU-4357 page allocation failure. mode:0x40 ca... Resolved
Related
is related to LU-3585 Client panic during IOR single file p... Resolved
Severity: 3
Rank (Obsolete): 9130

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/02803404-ed9b-11e2-8e3a-52540035b04c.

The sub-test test_iozone failed with the following error:

test failed to respond and timed out

client console:

12:52:41:Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone == 12:52:37 (1373831557)
12:52:41:Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1850592kB available, using 3844936kB file size
12:52:41:Lustre: DEBUG MARKER: min OST has 1850592kB available, using 3844936kB file size
12:54:03:ptlrpcd_1: page allocation failure. order:1, mode:0x40
12:54:03:Pid: 28900, comm: ptlrpcd_1 Not tainted 2.6.32-358.11.1.el6.x86_64 #1
12:54:03:Call Trace:
12:54:03: [<ffffffff8112c157>] ? __alloc_pages_nodemask+0x757/0x8d0
12:54:03: [<ffffffffa06627df>] ? ptlrpc_set_add_new_req+0xcf/0x150 [ptlrpc]
12:54:03: [<ffffffff81166b02>] ? kmem_getpages+0x62/0x170
12:54:03: [<ffffffff8116771a>] ? fallback_alloc+0x1ba/0x270
12:54:03: [<ffffffff8116716f>] ? cache_grow+0x2cf/0x320
12:54:03: [<ffffffff81167499>] ? ____cache_alloc_node+0x99/0x160
12:54:03: [<ffffffffa048fd57>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:54:03: [<ffffffff81168269>] ? __kmalloc+0x189/0x220
12:54:03: [<ffffffffa048fd57>] ? LNetMDAttach+0x157/0x5a0 [lnet]
12:54:03: [<ffffffffa0670b55>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
12:54:03: [<ffffffffa069ea48>] ? sptlrpc_import_sec_ref+0x28/0x50 [ptlrpc]
12:54:03: [<ffffffffa0671fa2>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
12:54:03: [<ffffffffa06a65b4>] ? sptlrpc_req_refresh_ctx+0x164/0x970 [ptlrpc]
12:54:03: [<ffffffffa0890971>] ? osc_io_unplug+0x11/0x20 [osc]
12:54:03: [<ffffffffa06676bb>] ? ptlrpc_send_new_req+0x45b/0x7a0 [ptlrpc]
12:54:03: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
12:54:03: [<ffffffffa066b3a8>] ? ptlrpc_check_set+0x878/0x1b20 [ptlrpc]
12:54:03: [<ffffffffa06976cb>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
12:54:03: [<ffffffff8109705c>] ? remove_wait_queue+0x3c/0x50
12:54:03: [<ffffffffa0697b50>] ? ptlrpcd+0x190/0x380 [ptlrpc]
12:54:03: [<ffffffff81063310>] ? default_wake_function+0x0/0x20
12:54:03: [<ffffffffa06979c0>] ? ptlrpcd+0x0/0x380 [ptlrpc]
12:54:03: [<ffffffff81096936>] ? kthread+0x96/0xa0
12:54:03: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
12:54:03: [<ffffffff810968a0>] ? kthread+0x0/0xa0
12:54:03: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
12:54:03:Mem-Info:
12:54:03:Node 0 DMA per-cpu:
12:54:03:CPU    0: hi:    0, btch:   1 usd:   0
12:54:03:Node 0 DMA32 per-cpu:
12:54:03:CPU    0: hi:  186, btch:  31 usd: 159


 Comments   
Comment by Keith Mannthey (Inactive) [ 16/Jul/13 ]

The failure is not fatal is is how the rest of the code deals with it.

12:54:03:128789 pages non-shared
12:54:03:LNetError: 28900:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.52/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
12:54:03:LNetError: 28900:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 10894919 total bytes allocated by lnet
12:54:03:BUG: unable to handle kernel NULL pointer dereference at 00000000000000b8
12:54:03:IP: [<ffffffffa0670d5a>] ptlrpc_register_bulk+0x46a/0x9d0 [ptlrpc]
12:54:03:PGD 5e7df067 PUD 38cce067 PMD 0 
12:54:03:Oops: 0000 [#1] SMP 

This looks to be the same as: LU-3585 Client panic during IOR single file per process. Lnet is not failing allocation very well it seems.

Comment by Keith Mannthey (Inactive) [ 16/Jul/13 ]

Same Lnet error message.

Comment by Keith Mannthey (Inactive) [ 16/Jul/13 ]

Lets just track this in the earlier LU.

Comment by Sarah Liu [ 27/Sep/13 ]

Hit this error again in tag-2.4.93 testing
https://maloo.whamcloud.com/test_sets/9eeff4f8-2727-11e3-88c6-52540035b04c

22:09:10:Lustre: DEBUG MARKER: == sanity-benchmark test iozone: iozone == 22:09:03 (1380172143)
22:09:10:Lustre: DEBUG MARKER: /usr/sbin/lctl mark min OST has 1850320kB available, using 3844920kB file size
22:09:11:Lustre: DEBUG MARKER: min OST has 1850320kB available, using 3844920kB file size
22:11:24:ptlrpcd_1: page allocation failure. order:1, mode:0x40
22:11:25:Pid: 2459, comm: ptlrpcd_1 Not tainted 2.6.32-358.18.1.el6.x86_64 #1
22:11:25:Call Trace:
22:11:26: [<ffffffff8112c257>] ? __alloc_pages_nodemask+0x757/0x8d0
22:11:27: [<ffffffff81166d92>] ? kmem_getpages+0x62/0x170
22:11:27: [<ffffffff811679aa>] ? fallback_alloc+0x1ba/0x270
22:11:27: [<ffffffff811673ff>] ? cache_grow+0x2cf/0x320
22:11:27: [<ffffffff81167729>] ? ____cache_alloc_node+0x99/0x160
22:11:28: [<ffffffffa048ced7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
22:11:28: [<ffffffff811684f9>] ? __kmalloc+0x189/0x220
22:11:28: [<ffffffffa048ced7>] ? LNetMDAttach+0x157/0x5a0 [lnet]
22:11:28: [<ffffffffa066dae5>] ? ptlrpc_register_bulk+0x265/0x9d0 [ptlrpc]
22:11:29: [<ffffffffa069a608>] ? sptlrpc_import_sec_ref+0x28/0x50 [ptlrpc]
22:11:30: [<ffffffffa066ef42>] ? ptl_send_rpc+0x232/0xc40 [ptlrpc]
22:11:30: [<ffffffffa06a1ebc>] ? sptlrpc_req_refresh_ctx+0x15c/0x930 [ptlrpc]
22:11:30: [<ffffffffa0664674>] ? ptlrpc_send_new_req+0x454/0x790 [ptlrpc]
22:11:30: [<ffffffffa06683e8>] ? ptlrpc_check_set+0x888/0x1b40 [ptlrpc]
22:11:31: [<ffffffffa069322b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
22:11:31: [<ffffffffa069374b>] ? ptlrpcd+0x20b/0x370 [ptlrpc]
22:11:32: [<ffffffff81063410>] ? default_wake_function+0x0/0x20
22:11:33: [<ffffffffa0693540>] ? ptlrpcd+0x0/0x370 [ptlrpc]
22:11:33: [<ffffffff81096a36>] ? kthread+0x96/0xa0
22:11:33: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
22:11:33: [<ffffffff810969a0>] ? kthread+0x0/0xa0
22:11:33: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
22:11:33:Mem-Info:
22:11:33:Node 0 DMA per-cpu:
22:11:34:CPU    0: hi:    0, btch:   1 usd:   0
22:11:34:Node 0 DMA32 per-cpu:
22:11:35:CPU    0: hi:  186, btch:  31 usd:  33
22:11:35:active_anon:6072 inactive_anon:6123 isolated_anon:0
22:11:35: active_file:29973 inactive_file:288526 isolated_file:0
22:11:35: unevictable:0 dirty:36357 writeback:16384 unstable:0
22:11:35: free:16750 slab_reclaimable:9827 slab_unreclaimable:110251
22:11:36: mapped:3569 shmem:38 pagetables:1027 bounce:0
22:11:36:Node 0 DMA free:8272kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:5748kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15324kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:40kB slab_unreclaimable:1672kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
22:11:36:lowmem_reserve[]: 0 2003 2003 2003
22:11:37:Node 0 DMA32 free:58728kB min:44720kB low:55900kB high:67080kB active_anon:24288kB inactive_anon:24492kB active_file:119892kB inactive_file:1148356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052064kB mlocked:0kB dirty:145428kB writeback:65536kB mapped:14276kB shmem:152kB slab_reclaimable:39268kB slab_unreclaimable:439332kB kernel_stack:1648kB pagetables:4108kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
22:11:38:lowmem_reserve[]: 0 0 0 0
22:11:38:Node 0 DMA: 2*4kB 3*8kB 1*16kB 1*32kB 4*64kB 4*128kB 5*256kB 2*512kB 1*1024kB 2*2048kB 0*4096kB = 8272kB
22:11:38:Node 0 DMA32: 13044*4kB 25*8kB 11*16kB 3*32kB 3*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 58728kB
22:11:38:318554 total pagecache pages
22:11:38:0 pages in swap cache
22:11:38:Swap cache stats: add 0, delete 0, find 0/0
22:11:38:Free swap  = 4128760kB
22:11:39:Total swap = 4128760kB
22:11:39:524284 pages RAM
22:11:40:43669 pages reserved
22:11:40:342147 pages shared
22:11:41:154367 pages non-shared
22:11:41:LNetError: 2459:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: out of memory at /var/lib/jenkins/workspace/lustre-master/arch/x86_64/build_type/client/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-2.4.93/lnet/include/lnet/lib-lnet.h:457 (tried to alloc '(md)' = 4208)
22:11:41:LNetError: 2459:0:(lib-lnet.h:457:lnet_md_alloc()) LNET: 10962695 total bytes allocated by lnet
22:11:41:LustreError: 2459:0:(niobuf.c:376:ptlrpc_register_bulk()) lustre-OST0004-osc-ffff88005dc12000: LNetMDAttach failed x1447210203467200/0: rc = -12
Comment by Sarah Liu [ 01/Oct/13 ]

similar error in SLES11 SP2 client

https://maloo.whamcloud.com/test_sets/0194542a-2915-11e3-b598-52540035b04c

Comment by Jian Yu [ 09/Jan/14 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_5/9/
Distro/Arch: RHEL6.4/x86_64(server), SLES11SP3/x86_64(client)

The same failure occurred:
https://maloo.whamcloud.com/test_sets/2b6e3350-7826-11e3-9a46-52540035b04c

Comment by Jian Yu [ 07/Feb/14 ]

More instance on Lustre b2_5 branch:
https://maloo.whamcloud.com/test_sets/4bc449ba-8fe5-11e3-8d88-52540035b04c

Comment by Andreas Dilger [ 13/Feb/14 ]

Also shows "mode:0x40" when allocation fails.

Generated at Sat Feb 10 01:35:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.