[LU-5773] obdfilter-survey test 1c: oom occurred on OSS Created: 20/Oct/14  Updated: 12/May/16  Resolved: 16/Jul/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0, Lustre 2.5.3
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Jian Yu Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: 22pl, MB, mq115
Environment:

Lustre build: https://build.hpdd.intel.com/job/lustre-master/2684
Distro/Arch: RHEL6.5/x86_64


Issue Links:
Duplicate
duplicates LU-5079 conf-sanity test_47 timeout Resolved
duplicates LU-3366 Test failure obdfilter-survey, subtes... Resolved
is duplicated by LU-6004 obdfilter-survey test_2a: obdfilter i... Resolved
is duplicated by LU-5920 obdfilter-survey test_1c: OST OOM Closed
Related
is related to LU-4768 ost-survey hangs on client 2.4 Resolved
is related to LU-6064 obdfilter-survey test_1c: test failed... Resolved
Severity: 3
Rank (Obsolete): 16206

 Description   

While running obdfilter-survey test 1c, oom failure occurred on OSS:

21:17:56:Lustre: DEBUG MARKER: == obdfilter-survey test 1c: Object Storage Targets survey, big batch == 02:50:56 (1412823056)
21:17:56:Lustre: DEBUG MARKER: lctl dl | grep obdfilter
21:17:56:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
21:17:56:Lustre: Echo OBD driver; http://www.lustre.org/
21:17:56:hrtimer: interrupt took 7516 ns
21:17:56:lctl invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
21:17:56:lctl cpuset=/ mems_allowed=0
21:17:56:Pid: 19467, comm: lctl Not tainted 2.6.32-431.29.2.el6_lustre.x86_64 #1
21:17:56:Call Trace:
21:17:56: [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
21:17:56: [<ffffffff81122b80>] ? dump_header+0x90/0x1b0
21:17:56: [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80
21:17:56: [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
21:17:56: [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
21:17:56: [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150
21:17:56: [<ffffffff811499dd>] ? do_wp_page+0xfd/0x920
21:17:56: [<ffffffff8133e4f5>] ? misc_open+0x1d5/0x330
21:17:56: [<ffffffff8114a9fd>] ? handle_pte_fault+0x2cd/0xb00
21:17:56: [<ffffffff8118d495>] ? chrdev_open+0x125/0x230
21:17:56: [<ffffffff811ab840>] ? mntput_no_expire+0x30/0x110
21:17:56: [<ffffffff8118d370>] ? chrdev_open+0x0/0x230
21:17:56: [<ffffffff811863bf>] ? __dentry_open+0x23f/0x360
21:17:56: [<ffffffff812284ef>] ? security_inode_permission+0x1f/0x30
21:17:56: [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300
21:17:56: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
21:17:56: [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
21:17:56: [<ffffffff8152c615>] ? page_fault+0x25/0x30

Maloo report: https://testing.hpdd.intel.com/test_sets/973e0216-4fcd-11e4-8e65-5254006e85c2



 Comments   
Comment by Jian Yu [ 20/Oct/14 ]

More instance on master branch: https://testing.hpdd.intel.com/test_sets/ce9211ca-4bb9-11e4-b821-5254006e85c2

Comment by Andreas Dilger [ 21/Oct/14 ]

The patch to fix this was landed under LU-5079.

Comment by Jian Yu [ 11/Nov/14 ]

Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/100/
Distro/Arch: RHEL6.5/x86_64 + SLES11SP3/x86_64 (Server + Client)

The same oom failure occurred on OSS while running obdfilter-survey test 1c:

automount invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
automount cpuset=/ mems_allowed=0
Pid: 10749, comm: automount Not tainted 2.6.32-431.29.2.el6_lustre.gf7f3864.x86_64 #1
Call Trace:
 [<ffffffff810d07b1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
 [<ffffffff81122b80>] ? dump_header+0x90/0x1b0
 [<ffffffff81122cee>] ? check_panic_on_oom+0x4e/0x80
 [<ffffffff811233db>] ? out_of_memory+0x1bb/0x3c0
 [<ffffffff8112fd5f>] ? __alloc_pages_nodemask+0x89f/0x8d0
 [<ffffffff81167dea>] ? alloc_pages_vma+0x9a/0x150
 [<ffffffff8114ae6d>] ? handle_pte_fault+0x73d/0xb00
 [<ffffffff811f4570>] ? proc_delete_inode+0x0/0x80
 [<ffffffff8128b96e>] ? number+0x2ee/0x320
 [<ffffffff8114b45a>] ? handle_mm_fault+0x22a/0x300
 [<ffffffff8128b96e>] ? number+0x2ee/0x320
 [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
 [<ffffffff8128c210>] ? string+0x40/0x100
 [<ffffffff8128d776>] ? vsnprintf+0x336/0x5e0
 [<ffffffff8152f25e>] ? do_page_fault+0x3e/0xa0
 [<ffffffff8152c615>] ? page_fault+0x25/0x30
 [<ffffffff8128e71e>] ? copy_user_generic_unrolled+0x3e/0xb0
 [<ffffffff811aedb2>] ? seq_read+0x2d2/0x400
 [<ffffffff81189a95>] ? vfs_read+0xb5/0x1a0
 [<ffffffff81189bd1>] ? sys_read+0x51/0x90
 [<ffffffff810e204e>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

Maloo report: https://testing.hpdd.intel.com/test_sets/4bf3d7bc-6874-11e4-acbe-5254006e85c2

The patch for LU-5079 was included in this build, so, this is not a duplicate issue.

Comment by Jian Yu [ 13/Nov/14 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/e6ec8538-6b45-11e4-88ff-5254006e85c2

Comment by Jian Yu [ 02/Dec/14 ]

More instances on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/12401f30-7a4e-11e4-b9fd-5254006e85c2
https://testing.hpdd.intel.com/test_sets/393c5232-7a55-11e4-807e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ee460ed2-7980-11e4-aa22-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f55941c6-6a58-11e4-b203-5254006e85c2
https://testing.hpdd.intel.com/test_sets/87b3b698-5cdd-11e4-8561-5254006e85c2
https://testing.hpdd.intel.com/test_sets/bce60eac-4eef-11e4-872e-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6702eeee-7d55-11e4-943c-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6502c6d6-7d33-11e4-943c-5254006e85c2

Comment by Jian Yu [ 08/Dec/14 ]

More instance on master branch:
https://testing.hpdd.intel.com/test_sets/19e98088-7e99-11e4-ab67-5254006e85c2

Comment by Andreas Dilger [ 09/Dec/14 ]

I found something strange in the OST logs - hundreds of lctl processes are running on the node, like it is a fork bomb:

11:04:49:[28448]     0 28448     3820      230   0       0             0 lctl
11:04:49:[28449]     0 28449     3820      229   0       0             0 lctl
11:04:49:[28450]     0 28450     3820      228   1       0             0 lctl
11:04:49:[28451]     0 28451     3820      230   0       0             0 lctl
11:04:49:[28452]     0 28452     3820      229   1       0             0 lctl
[repeats]
Comment by Jodi Levi (Inactive) [ 09/Dec/14 ]

Niu,
Could you please have a look at this one?
Thank you!

Comment by Andreas Dilger [ 09/Dec/14 ]

It appears that http://review.whamcloud.com/11971 was changing the obdfilter-survey script, which landed on Oct 5th, and this bug was filed on Oct 20th.

Comment by James Nunez (Inactive) [ 09/Dec/14 ]

Patch http://review.whamcloud.com/#/c/11971/ modified the ost-survey script, not obdfilter. Is there still a connection with this ticket and the 11971 patch?

Comment by Niu Yawei (Inactive) [ 10/Dec/14 ]

Looks dup of LU-3366. Perhaps the vm (OSS) can't afford such test (7 OSTs, 128 brw threads for each OST)?

Comment by Jian Yu [ 10/Dec/14 ]

More instance on Lustre b2_5 branch:
https://testing.hpdd.intel.com/test_sets/dc5dabda-8074-11e4-a434-5254006e85c2

Comment by Jodi Levi (Inactive) [ 12/Dec/14 ]

Do we need to back port http://review.whamcloud.com/#/c/11971/ to other branches?

Comment by Niu Yawei (Inactive) [ 15/Dec/14 ]

Jodi, it seems to me that patch isn't related to this failure.

Comment by Niu Yawei (Inactive) [ 15/Dec/14 ]

Do we still have real machine which has larger memory in our auto-test system? I presume such failure wouldn't occur on that system. Probably we should reduce the OST/thread count for the test_1c to make it runable on the 2G mem VMs?

Comment by Andreas Dilger [ 15/Dec/14 ]

Niu or Yu Jian, could you please look into a patch to change obdfilter-survey to reduce the threads count when running in a low-memory VM, so it doesn't hit this OOM? We still want to run this test during autotest to make sure that the test script doesn't break, but it just needs to run basic functionality/stress tests since the performance numbers from a VM are useless.

Comment by Niu Yawei (Inactive) [ 16/Dec/14 ]

Ok, I'm going to cook a patch soon.

Comment by Niu Yawei (Inactive) [ 16/Dec/14 ]

http://review.whamcloud.com/13078

Comment by Gerrit Updater [ 17/Dec/14 ]

Niu Yawei (yawei.niu@intel.com) uploaded a new patch: http://review.whamcloud.com/13101
Subject: LU-5773 test: reduce thread count
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 113e635c1b68afa7698ec3b894647526ce2fef79

Comment by Gerrit Updater [ 21/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13078/
Subject: LU-5773 test: reduce thread count
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d1a717ed189a1245af1f96ecb701cd869956ef75

Generated at Sat Feb 10 01:54:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.