[LU-3851] Single-client scale test: ptlrpcd page allocation failure Created: 28/Aug/13  Updated: 02/Jun/14  Resolved: 13/Feb/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Hyperion/LLNL


Issue Links:
Duplicate
duplicates LU-4357 page allocation failure. mode:0x40 ca... Resolved
Severity: 3
Rank (Obsolete): 9971

 Description   

Running iorfpp, scaling from 1-64 clients, had page_allocation failures

Aug 27 15:40:56 iwc1 kernel: ptlrpcd_8: page allocation failure. order:1, mode:0x40
Aug 27 15:40:56 iwc1 kernel: ptlrpcd_10: page allocation failure. order:1, mode:0x40
Aug 27 15:40:56 iwc1 kernel: Pid: 5769, comm: ptlrpcd_10 Not tainted 2.6.32-358.14.1.el6.x86_64 #1
Aug 27 15:40:56 iwc1 kernel: Call Trace:
Aug 27 15:40:56 iwc1 kernel: [<ffffffff8112c197>] ? __alloc_pages_nodemask+0x757/0x8d0
Aug 27 15:40:56 iwc1 kernel: [<ffffffff81166b42>] ? kmem_getpages+0x62/0x170
Aug 27 15:40:56 iwc1 kernel: [<ffffffff8116775a>] ? fallback_alloc+0x1ba/0x270
Aug 27 15:40:56 iwc1 kernel: [<ffffffff811671af>] ? cache_grow+0x2cf/0x320
Aug 27 15:40:56 iwc1 kernel: [<ffffffff811674d9>] ? ____cache_alloc_node+0x99/0x160
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0842a98>] ? ptlrpc_new_bulk+0x48/0x270 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffff811682a9>] ? __kmalloc+0x189/0x220
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0842a98>] ? ptlrpc_new_bulk+0x48/0x270 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0842d18>] ? ptlrpc_prep_bulk_imp+0x58/0x190 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa09f0084>] ? osc_brw_prep_request+0x294/0x11e0 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0a0446c>] ? osc_req_attr_set+0x16c/0x5b0 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0717941>] ? cl_req_attr_set+0xd1/0x230 [obdclass]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa09f6210>] ? osc_build_rpc+0x870/0x1840 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0a1052d>] ? osc_io_unplug0+0x16ad/0x1f20 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffff812811d6>] ? vsnprintf+0x336/0x5e0
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0a12aa1>] ? osc_io_unplug+0x11/0x20 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa09e4286>] ? brw_queue_work+0x36/0xd0 [osc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa083be77>] ? work_interpreter+0x27/0x90 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0844c5c>] ? ptlrpc_check_set+0x2ac/0x1b20 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa087006b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa08704f0>] ? ptlrpcd+0x190/0x380 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffff81063330>] ? default_wake_function+0x0/0x20
Aug 27 15:40:56 iwc1 kernel: [<ffffffffa0870360>] ? ptlrpcd+0x0/0x380 [ptlrpc]
Aug 27 15:40:56 iwc1 kernel: [<ffffffff81096956>] ? kthread+0x96/0xa0
Aug 27 15:40:56 iwc1 kernel: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Aug 27 15:40:56 iwc1 kernel: [<ffffffff810968c0>] ? kthread+0x0/0xa0
Aug 27 15:40:56 iwc1 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Aug 27 15:40:56 iwc1 kernel: Mem-Info:
Aug 27 15:40:56 iwc1 kernel: Node 0 DMA per-cpu:
Aug 27 15:40:56 iwc1 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Aug 27 15:40:56 iwc1 kernel: CPU    1: hi:    0, btch:   1 usd:   0
Aug 27 15:40:56 iwc1 kernel: CPU    2: hi:    0, btch:   1 usd:   0
Aug 27 15:40:56 iwc1 kernel: CPU    3: hi:    0, btch:   1 usd:   0
Aug 27 15:40:56 iwc1 kernel: CPU    4: hi:    0, btch:   1 usd:   0
Aug 27 15:40:56 iwc1 kernel: CPU    5: hi:    0, btch:   1 usd:   0



 Comments   
Comment by Keith Mannthey (Inactive) [ 28/Aug/13 ]

Cliff can you attach the whole console log? These errors are ok it themselves but it sounded like there was a crash soon after? I didn't see a vmcore from today for iwc1.

Comment by Andreas Dilger [ 04/Sep/13 ]

Also, information from /proc/slabinfo and /proc/meminfo.

Comment by Andreas Dilger [ 13/Feb/14 ]

This will be fixed in LU-4357. It is definitely the same problem, since it is showing the "0x40" allocation flag, which is only __GFP_IO.

Comment by Andreas Dilger [ 13/Feb/14 ]

Shows mode:0x40 == __GFP_IO, but missing __GFP_WAIT from LU-4357.

Generated at Sat Feb 10 01:37:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.