[LU-3366] Test failure obdfilter-survey, subtest test_1c: oom-killer Created: 20/May/13 Updated: 01/Jun/17 Resolved: 01/Jun/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.2, Lustre 2.5.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 8329 | ||||||||
| Description |
|
This issue was created by maloo for James Nunez <james.a.nunez@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/4ac8e14a-bf36-11e2-88e0-52540035b04c. The sub-test test_1c failed with the following error:
I see the following several times in the OSS console log: 06:25:50:lctl invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 OSS stack trace looks like: 08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0 08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290 08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 08:48:13:INFO: task lctl:11345 blocked for more than 120 seconds. 08:48:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 08:48:13:lctl D 0000000000000000 0 11345 11038 0x00000080 08:48:13: ffff880031179588 0000000000000082 ffff8800ffffffff 000088f8cde20d22 08:48:13: ffff88007a65b380 ffff880073cded70 0000000001306642 ffffffffae6a97e1 08:48:13: ffff880053b15ab8 ffff880031179fd8 000000000000fb88 ffff880053b15ab8 08:48:13:Call Trace: 08:48:13: [<ffffffff810a1ac9>] ? ktime_get_ts+0xa9/0xe0 08:48:13: [<ffffffff8150e723>] io_schedule+0x73/0xc0 08:48:13: [<ffffffff8125ea08>] get_request_wait+0x108/0x1d0 08:48:13: [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40 08:48:13: [<ffffffff81255c8b>] ? elv_merge+0x1cb/0x200 08:48:13: [<ffffffff8125eb4d>] blk_queue_bio+0x7d/0x5a0 08:48:13: [<ffffffff8125d1fe>] generic_make_request+0x26e/0x550 08:48:13: [<ffffffff8111c713>] ? mempool_alloc+0x63/0x140 08:48:13: [<ffffffff8125d56d>] submit_bio+0x8d/0x120 08:48:13: [<ffffffffa103e39e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass] 08:48:13: [<ffffffffa168adac>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs] 08:48:13: [<ffffffffa168b1cc>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs] 08:48:13: [<ffffffffa001702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs] 08:48:13: [<ffffffffa00172d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs] 08:48:13: [<ffffffffa168d788>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs] 08:48:13: [<ffffffffa0494a43>] ofd_preprw_read+0x253/0x7f0 [ofd] 08:48:13: [<ffffffffa049574a>] ofd_preprw+0x76a/0x13c0 [ofd] 08:48:13: [<ffffffffa05734eb>] echo_client_iocontrol+0x207b/0x3bd0 [obdecho] 08:48:13: [<ffffffff81143767>] ? handle_pte_fault+0xf7/0xb50 08:48:13: [<ffffffffa103247f>] class_handle_ioctl+0x12cf/0x1e90 [obdclass] 08:48:13: [<ffffffffa101a2ab>] obd_class_ioctl+0x4b/0x190 [obdclass] 08:48:13: [<ffffffff81195012>] vfs_ioctl+0x22/0xa0 08:48:13: [<ffffffff8103c7b8>] ? pvclock_clocksource_read+0x58/0xd0 08:48:13: [<ffffffff811951b4>] do_vfs_ioctl+0x84/0x580 08:48:13: [<ffffffff81195731>] sys_ioctl+0x81/0xa0 08:48:13: [<ffffffff810dc645>] ? __audit_syscall_exit+0x265/0x290 08:48:13: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 08:48:13: 08:48:13:<ConMan> Console [wtm-14vm8] disconnected from <wtm-14:6007> at 05-17 08:47. Info required for matching: obdfilter-survey 1c |
| Comments |
| Comment by Jian Yu [ 21/May/13 ] |
|
Lustre Branch: master (tag 2.4.50) Another instance: https://maloo.whamcloud.com/test_sets/d369af0a-c154-11e2-8769-52540035b04c |
| Comment by Oleg Drokin [ 21/May/13 ] |
|
So it seems the run is on less than 2G RAM on all nodes, we should not OOM, but oon the other hand it probably does not make much sense to run perf benchmark in small memory setup so it might be worth it disabling the test on low memory. In addition somebody should dig into what was it that used all the RAM on OSS |
| Comment by Jian Yu [ 17/Jan/14 ] |
|
Lustre client build: http://build.whamcloud.com/job/lustre-b2_5/13/ The same issue occurred while running obdfilter-survey test 1c: Console log on OSS: Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 23:34:40 (1389857680)^M Lustre: DEBUG MARKER: lctl dl | grep obdfilter^M Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'^M Lustre: Echo OBD driver; http://www.lustre.org/^M INFO: task lctl:23681 blocked for more than 120 seconds.^M "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.^M lctl D 0000000000000001 0 23681 23655 0x00000080^M ffff880046571588 0000000000000086 0000000000000000 00001e43189c6042^M ffff88007d773e00 ffff880076e4cad0 0000000000651b0a ffffffffad288d24^M ffff880051459098 ffff880046571fd8 000000000000fb88 ffff880051459098^M Call Trace:^M [<ffffffff8150ed93>] io_schedule+0x73/0xc0^M [<ffffffff8125ed08>] get_request_wait+0x108/0x1d0^M [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40^M [<ffffffff8125ee6b>] blk_queue_bio+0x9b/0x5d0^M [<ffffffff8125d51e>] generic_make_request+0x25e/0x520^M [<ffffffff8111c763>] ? mempool_alloc+0x63/0x140^M [<ffffffff8125d86d>] submit_bio+0x8d/0x120^M [<ffffffffa05c041e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]^M [<ffffffffa0ce2ffc>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]^M [<ffffffffa0ce341c>] osd_do_bio+0x3dc/0x800 [osd_ldiskfs]^M [<ffffffffa0d4702c>] ? fsfilt_map_nblocks+0xcc/0xf0 [fsfilt_ldiskfs]^M [<ffffffffa0d472d5>] ? fsfilt_ldiskfs_map_inode_pages+0x85/0x90 [fsfilt_ldiskfs]^M [<ffffffffa0ce59d8>] osd_read_prep+0x338/0x3b0 [osd_ldiskfs]^M [<ffffffffa0db5bd3>] ofd_preprw_read+0x253/0x7f0 [ofd]^M [<ffffffffa0db68ea>] ofd_preprw+0x77a/0x1480 [ofd]^M [<ffffffffa06ec473>] echo_client_iocontrol+0x2003/0x3b40 [obdecho]^M [<ffffffff81281876>] ? vsnprintf+0x336/0x5e0^M [<ffffffffa05b449f>] class_handle_ioctl+0x12ff/0x1ec0 [obdclass]^M [<ffffffffa059c2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]^M [<ffffffff81195382>] vfs_ioctl+0x22/0xa0^M [<ffffffff8103c7d8>] ? pvclock_clocksource_read+0x58/0xd0^M [<ffffffff81195524>] do_vfs_ioctl+0x84/0x580^M [<ffffffff81195aa1>] sys_ioctl+0x81/0xa0^M [<ffffffff810dc685>] ? __audit_syscall_exit+0x265/0x290^M [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b^M |