[LU-5773] obdfilter-survey test 1c: oom occurred on OSS Created: 20/Oct/14 Updated: 12/May/16 Resolved: 16/Jul/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0, Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 22pl, MB, mq115 | ||
| Environment: |
Lustre build: https://build.hpdd.intel.com/job/lustre-master/2684 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 16206 | ||||||||||||||||||||||||||||||||
| Description |
|
While running obdfilter-survey test 1c, oom failure occurred on OSS: 21:17:56:Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 02:50:56 (1412823056) 21:17:56:Lustre: DEBUG MARKER: lctl dl | grep obdfilter 21:17:56:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@' 21:17:56:Lustre: Echo OBD driver; http://www.lustre.org/ 21:17:56:hrtimer: interrupt took 7516 ns 21:17:56:lctl invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 21:17:56:lctl cpuset=/ mems_allowed=0 21:17:56:Pid: 19467, comm: lctl Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1 21:17:56:Call Trace: 21:17:56: [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 21:17:56: [<ffffffff81122b80>] ? dump_header+0x90/0x1b0 21:17:56: [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80 21:17:56: [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0 21:17:56: [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0 21:17:56: [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150 21:17:56: [<ffffffff811499dd>] ? do_wp_page+0xfd/0x920 21:17:56: [<ffffffff8133e4f5>] ? misc_open+0x1d5/0x330 21:17:56: [<ffffffff8114a9fd>] ? handle_pte_fault+0x2cd/0xb00 21:17:56: [<ffffffff8118d495>] ? chrdev_open+0x125/0x230 21:17:56: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110 21:17:56: [<ffffffff8118d370>] ? chrdev_open+0x0/0x230 21:17:56: [<ffffffff811863bf>] ? __dentry_open+0x23f/0x360 21:17:56: [<ffffffff812284ef>] ? security_inode_permission+0x1f/0x30 21:17:56: [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300 21:17:56: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0 21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0 21:17:56: [<ffffffff8152c615>] ? page_fault+0x25/0x30 Maloo report: https://testing.hpdd.intel.com/test_sets/973e0216-4fcd-11e4-8e65-5254006e85c2 |
| Comments |
| Comment by Jian Yu [ 20/Oct/14 ] |
|
More instance on master branch: https://testing.hpdd.intel.com/test_sets/ce9211ca-4bb9-11e4-b821-5254006e85c2 |
| Comment by Andreas Dilger [ 21/Oct/14 ] |
|
The patch to fix this was landed under |
| Comment by Jian Yu [ 11/Nov/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/ The same oom failure occurred on OSS while running obdfilter-survey test 1c: automount invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0 automount cpuset=/ mems_allowed=0 Pid: 10749, comm: automount Not tainted 2.6.32-431.29.2.el6_lustre.gf7f3864.x86_64 #1 Call Trace: [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 [<ffffffff81122b80>] ? dump_header+0x90/0x1b0 [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80 [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0 [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0 [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150 [<ffffffff8114ae6d>] ? handle_pte_fault+0x73d/0xb00 [<ffffffff811f4570>] ? proc_delete_inode+0x0/0x80 [<ffffffff8128b96e>] ? number+0x2ee/0x320 [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300 [<ffffffff8128b96e>] ? number+0x2ee/0x320 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480 [<ffffffff8128c210>] ? string+0x40/0x100 [<ffffffff8128d776>] ? vsnprintf+0x336/0x5e0 [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0 [<ffffffff8152c615>] ? page_fault+0x25/0x30 [<ffffffff8128e71e>] ? copy_user_generic_unrolled+0x3e/0xb0 [<ffffffff811aedb2>] ? seq_read+0x2d2/0x400 [<ffffffff81189a95>] ? vfs_read+0xb5/0x1a0 [<ffffffff81189bd1>] ? sys_read+0x51/0x90 [<ffffffff810e204e>] ? __audit_syscall_exit+0x25e/0x290 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Maloo report: https://testing.hpdd.intel.com/test_sets/4bf3d7bc-6874-11e4-acbe-5254006e85c2 The patch for |
| Comment by Jian Yu [ 13/Nov/14 ] |
|
More instance on master branch: |
| Comment by Jian Yu [ 02/Dec/14 ] |
|
More instances on Lustre b2_5 branch: |
| Comment by Jian Yu [ 08/Dec/14 ] |
|
More instance on master branch: |
| Comment by Andreas Dilger [ 09/Dec/14 ] |
|
I found something strange in the OST logs - hundreds of lctl processes are running on the node, like it is a fork bomb: 11:04:49:[28448] 0 28448 3820 230 0 0 0 lctl 11:04:49:[28449] 0 28449 3820 229 0 0 0 lctl 11:04:49:[28450] 0 28450 3820 228 1 0 0 lctl 11:04:49:[28451] 0 28451 3820 230 0 0 0 lctl 11:04:49:[28452] 0 28452 3820 229 1 0 0 lctl [repeats] |
| Comment by Jodi Levi (Inactive) [ 09/Dec/14 ] |
|
Niu, |
| Comment by Andreas Dilger [ 09/Dec/14 ] |
|
It appears that http://review.whamcloud.com/11971 was changing the obdfilter-survey script, which landed on Oct 5th, and this bug was filed on Oct 20th. |
| Comment by James Nunez (Inactive) [ 09/Dec/14 ] |
|
Patch http://review.whamcloud.com/#/c/11971/ modified the ost-survey script, not obdfilter. Is there still a connection with this ticket and the 11971 patch? |
| Comment by Niu Yawei (Inactive) [ 10/Dec/14 ] |
|
Looks dup of |
| Comment by Jian Yu [ 10/Dec/14 ] |
|
More instance on Lustre b2_5 branch: |
| Comment by Jodi Levi (Inactive) [ 12/Dec/14 ] |
|
Do we need to back port http://review.whamcloud.com/#/c/11971/ to other branches? |
| Comment by Niu Yawei (Inactive) [ 15/Dec/14 ] |
|
Jodi, it seems to me that patch isn't related to this failure. |
| Comment by Niu Yawei (Inactive) [ 15/Dec/14 ] |
|
Do we still have real machine which has larger memory in our auto-test system? I presume such failure wouldn't occur on that system. Probably we should reduce the OST/thread count for the test_1c to make it runable on the 2G mem VMs? |
| Comment by Andreas Dilger [ 15/Dec/14 ] |
|
Niu or Yu Jian, could you please look into a patch to change obdfilter-survey to reduce the threads count when running in a low-memory VM, so it doesn't hit this OOM? We still want to run this test during autotest to make sure that the test script doesn't break, but it just needs to run basic functionality/stress tests since the performance numbers from a VM are useless. |
| Comment by Niu Yawei (Inactive) [ 16/Dec/14 ] |
|
Ok, I'm going to cook a patch soon. |
| Comment by Niu Yawei (Inactive) [ 16/Dec/14 ] |
| Comment by Gerrit Updater [ 17/Dec/14 ] |
|
Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13101 |
| Comment by Gerrit Updater [ 21/Dec/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13078/ |