[LU-3366] Test failure obdfilter-survey, subtest test_1c: oom-killer Created: 20/May/13  Updated: 01/Jun/17  Resolved: 01/Jun/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2, Lustre 2.5.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-5773 obdfilter-survey test 1c: oom occurre... Resolved
Severity: 3
Rank (Obsolete): 8329

 Description   

This issue was created by maloo for James Nunez <james.a.nunez@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4ac8e14a-bf36-11e2-88e0-52540035b04c.

The sub-test test_1c failed with the following error:

test failed to respond and timed out

I see the following several times in the OSS console log:

06:25:50:lctl invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0

OSS stack trace looks like:

08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0
08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290
08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
08:48:13:INFO: task lctl:11345 blocked for more than 120 seconds.
08:48:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
08:48:13:lctl          D 0000000000000000     0 11345  11038 0x00000080
08:48:13: ffff880031179588 0000000000000082 ffff8800ffffffff 000088f8cde20d22
08:48:13: ffff88007a65b380 ffff880073cded70 0000000001306642 ffffffffae6a97e1
08:48:13: ffff880053b15ab8 ffff880031179fd8 000000000000fb88 ffff880053b15ab8
08:48:13:Call Trace:
08:48:13: [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0
08:48:13: [<ffffffff8150e723>] io_schedule+0x73/0xc0
08:48:13: [<ffffffff8125ea08>] get_request_wait+0x108/0x1d0
08:48:13: [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
08:48:13: [<ffffffff81255c8b>] ? elv_merge+0x1cb/0x200
08:48:13: [<ffffffff8125eb4d>] blk_queue_bio+0x7d/0x5a0
08:48:13: [<ffffffff8125d1fe>] generic_make_request+0x26e/0x550
08:48:13: [<ffffffff8111c713>] ? mempool_alloc+0x63/0x140
08:48:13: [<ffffffff8125d56d>] submit_bio+0x8d/0x120
08:48:13: [<ffffffffa103e39e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
08:48:13: [<ffffffffa168adac>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
08:48:13: [<ffffffffa168b1cc>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]
08:48:13: [<ffffffffa001702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]
08:48:13: [<ffffffffa00172d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]
08:48:13: [<ffffffffa168d788>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]
08:48:13: [<ffffffffa0494a43>] ofd_preprw_read+0x253/0x7f0 [ofd]
08:48:13: [<ffffffffa049574a>] ofd_preprw+0x76a/0x13c0 [ofd]
08:48:13: [<ffffffffa05734eb>] echo_client_iocontrol+0x207b/0x3bd0 [obdecho]
08:48:13: [<ffffffff81143767>] ? handle_pte_fault+0xf7/0xb50
08:48:13: [<ffffffffa103247f>] class_handle_ioctl+0x12cf/0x1e90 [obdclass]
08:48:13: [<ffffffffa101a2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
08:48:13: [<ffffffff81195012>] vfs_ioctl+0x22/0xa0
08:48:13: [<ffffffff8103c7b8>] ? pvclock_clocksource_read+0x58/0xd0
08:48:13: [<ffffffff811951b4>] do_vfs_ioctl+0x84/0x580
08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0
08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290
08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
08:48:13:
08:48:13:<ConMan> Console [wtm-14vm8] disconnected from <wtm-14:6007> at 05-17 08:47.

Info required for matching: obdfilter-survey 1c



 Comments   
Comment by Jian Yu [ 21/May/13 ]

Lustre Branch: master (tag 2.4.50)

Another instance: https://maloo.whamcloud.com/test_sets/d369af0a-c154-11e2-8769-52540035b04c

Comment by Oleg Drokin [ 21/May/13 ]

So it seems the run is on less than 2G RAM on all nodes, we should not OOM, but oon the other hand it probably does not make much sense to run perf benchmark in small memory setup so it might be worth it disabling the test on low memory.

In addition somebody should dig into what was it that used all the RAM on OSS

Comment by Jian Yu [ 17/Jan/14 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_5/13/
Lustre server build: http://build.whamcloud.com/job/lustre-b2_4/70/ (2.4.2)

The same issue occurred while running obdfilter-survey test 1c:
https://maloo.whamcloud.com/test_sets/cc27c530-7ed5-11e3-8a9b-52540035b04c

Console log on OSS:

Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 23:34:40 (1389857680)^M
Lustre: DEBUG MARKER: lctl dl | grep obdfilter^M
Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'^M
Lustre: Echo OBD driver; http://www.lustre.org/^M
INFO: task lctl:23681 blocked for more than 120 seconds.^M
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M
lctl          D 0000000000000001     0 23681  23655 0x00000080^M
 ffff880046571588 0000000000000086 0000000000000000 00001e43189c6042^M
 ffff88007d773e00 ffff880076e4cad0 0000000000651b0a ffffffffad288d24^M
 ffff880051459098 ffff880046571fd8 000000000000fb88 ffff880051459098^M
Call Trace:^M
 [<ffffffff8150ed93>] io_schedule+0x73/0xc0^M
 [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0^M
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40^M
 [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0^M
 [<ffffffff8125d51e>] generic_make_request+0x25e/0x520^M
 [<ffffffff8111c763>] ? mempool_alloc+0x63/0x140^M
 [<ffffffff8125d86d>] submit_bio+0x8d/0x120^M
 [<ffffffffa05c041e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]^M
 [<ffffffffa0ce2ffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]^M
 [<ffffffffa0ce341c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]^M
 [<ffffffffa0d4702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]^M
 [<ffffffffa0d472d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]^M
 [<ffffffffa0ce59d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]^M
 [<ffffffffa0db5bd3>] ofd_preprw_read+0x253/0x7f0 [ofd]^M
 [<ffffffffa0db68ea>] ofd_preprw+0x77a/0x1480 [ofd]^M
 [<ffffffffa06ec473>] echo_client_iocontrol+0x2003/0x3b40 [obdecho]^M
 [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0^M
 [<ffffffffa05b449f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]^M
 [<ffffffffa059c2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M
 [<ffffffff81195382>] vfs_ioctl+0x22/0xa0^M
 [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0^M
 [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580^M
 [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0^M
 [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290^M
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b^M
Generated at Sat Feb 10 01:33:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.