[LU-10056] sanity test_60a invokes oom-killer in subtest 7f and times out Created: 02/Oct/17  Updated: 27/Jan/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-7329 sanity test_60a timeouts with “* invo... Resolved
is related to LU-7883 sanity test_60a invokes oom-killer in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Console log on MDS:

 Lustre: 32199:0:(llog_test.c:1018:llog_test_7_sub()) 7_sub: records are not aligned, written 64071 from 64767
 Lustre: 32199:0:(llog_test.c:1124:llog_test_7()) 7f: test llog_changelog_user_rec
 sssd_ssh invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
 sssd_ssh cpuset=/ mems_allowed=0
 CPU: 0 PID: 665 Comm: sssd_ssh Tainted: P           OE  ------------   3.10.0-693.1.1.el7_lustre.x86_64 #1
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
  ffff88003690dee0 000000007415b4cd ffff88007a0b39f0 ffffffff816a3d6d
  ffff88007a0b3a80 ffffffff8169f186 ffff88007a0b3ae8 ffff88007a0b3a40
  ffffffff816b04dc ffffffff81a6ea00 0000000000000000 0000000000000000
 Call Trace:
  [<ffffffff816a3d6d>] dump_stack+0x19/0x1b
  [<ffffffff8169f186>] dump_header+0x90/0x229
  [<ffffffff816b04dc>] ? notifier_call_chain+0x4c/0x70
  [<ffffffff810b6ab8>] ? __blocking_notifier_call_chain+0x58/0x70
  [<ffffffff8118653e>] check_panic_on_oom+0x2e/0x60
  [<ffffffff8118695b>] out_of_memory+0x23b/0x4f0
  [<ffffffff8169fc8a>] __alloc_pages_slowpath+0x5d6/0x724
  [<ffffffff8118cd85>] __alloc_pages_nodemask+0x405/0x420
  [<ffffffff811d412f>] alloc_pages_vma+0xaf/0x1f0
  [<ffffffff811c3830>] ? end_swap_bio_write+0x80/0x80
  [<ffffffff811c453d>] read_swap_cache_async+0xed/0x160
  [<ffffffff811c4658>] swapin_readahead+0xa8/0x110
  [<ffffffff811b235b>] handle_mm_fault+0xadb/0xfa0
  [<ffffffff8109ea4c>] ? signal_setup_done+0x3c/0x60
  [<ffffffff816affb4>] __do_page_fault+0x154/0x450
  [<ffffffff816b02e5>] do_page_fault+0x35/0x90
  [<ffffffff816ac508>] page_fault+0x28/0x30

Maloo reports:
https://testing.hpdd.intel.com/test_sessions/2a0ab571-1893-4650-bdbf-4b24c85e8367
https://testing.hpdd.intel.com/test_sessions/9e63bf6c-094f-4c49-8823-de2ad48b5302



 Comments   
Comment by Bruno Faccini (Inactive) [ 24/Oct/17 ]

+1 at https://testing.hpdd.intel.com/test_sets/68e44f30-b87d-11e7-9abd-52540065bddc

I have done some debug on the associated MDS crash-dump due to OOM. It looks like again, the kmalloc-512 kmem_cache's Slabs consume almost all available memory (>1.2GB vs 1.6GB), like for LU-7329 and LU-7883 but this time at earlier llog_test/llog_test_7() step than llog_test_10() before!

Could it be that something in the auto-test VMs/OS/daemons/... configs has changed ?? And thus we need to apply the same fix (dt_sync() calls to flush journal callbacks) to prior log_test's sub-tests than only in llog_test_10() ??

Comment by Andreas Dilger [ 08/Nov/17 ]

This has been hit a few more times in the past 4 weeks.

Comment by Bob Glossman (Inactive) [ 27/Jan/18 ]

another on b2_10:
https://testing.hpdd.intel.com/test_sets/975629ee-0385-11e8-a7cd-52540065bddc

Generated at Sat Feb 10 02:31:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.