[LU-13162] parallel-scale test_statahead: mdsrate invoked oom-killer Created: 21/Jan/20  Updated: 23/Feb/21  Resolved: 17/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Duplicate Votes: 0
Labels: rhel8
Environment:

RHEL 8.1 client + RHEL 7.7 server


Issue Links:
Related
is related to LU-12830 RHEL8.3 and ZFS: oom on OSS Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for jianyu <yujian@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/0925c456-3ba8-11ea-bb75-52540065bddc

test_statahead failed with the following error:

+ su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe -machinefile /tmp/auster.machines -np 64 /usr/lib64/openmpi/bin/mdsrate --mknod --dir /mnt/lustre/dstatahead --nfiles 160711 --filefmt 'f%%d' "
[1579521814.463727] [trevis-12vm6:7395 :0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable
[1579522133.063761] [trevis-12vm7:26045:0]            cpu.c:52   UCX  WARN  CPU does not support invariant TSC, time may be unstable

Clients crashed:

[69145.465824] Lustre: DEBUG MARKER: == parallel-scale test statahead: statahead test, multiple clients =================================== 12:03:31 (1579521811)
[69145.622411] Lustre: lustre-OST0000-osc-ffff9d88618a4800: reconnect after 7127s idle
[69166.377333] Lustre: lustre-OST0000-osc-ffff9d88618a4800: disconnect after 21s idle
[69173.846404] mdsrate invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
[69173.848578] mdsrate cpuset=/ mems_allowed=0
[69173.849368] CPU: 1 PID: 7399 Comm: mdsrate Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.3.1.el8_1.x86_64 #1
[69173.851539] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[69173.852562] Call Trace:
[69173.853154]  dump_stack+0x5c/0x80
[69173.853821]  dump_header+0x6e/0x27a
[69173.854515]  ? notifier_call_chain+0x47/0x70
[69173.855409]  out_of_memory.cold.32+0xa/0x80
[69173.856169]  __alloc_pages_slowpath+0xc0f/0xce0
[69173.856982]  __alloc_pages_nodemask+0x245/0x280
[69173.857837]  alloc_pages_vma+0x74/0x1d0
[69173.858574]  do_anonymous_page+0x90/0x370
[69173.859325]  __handle_mm_fault+0x66e/0x6b0
[69173.860069]  handle_mm_fault+0xda/0x200
[69173.860764]  __get_user_pages+0x255/0x7c0
[69173.861541]  ? _cond_resched+0x15/0x30
[69173.862230]  get_user_pages+0x3e/0x50
[69173.862898]  get_user_pages_longterm+0x34/0x190
[69173.863772]  ib_umem_get+0x2ee/0x520 [ib_core]
[69173.864602]  mlx4_ib_reg_user_mr+0x71/0x1e0 [mlx4_ib]
[69173.865519]  ib_uverbs_reg_mr+0x143/0x240 [ib_uverbs]
[69173.866428]  ? __blk_mq_run_hw_queue+0x51/0xd0
[69173.867218]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
[69173.868475]  ib_uverbs_run_method+0x20c/0x7a0 [ib_uverbs]
[69173.869438]  ? __switch_to_asm+0x35/0x70
[69173.870146]  ? uverbs_disassociate_api+0x100/0x100 [ib_uverbs]
[69173.871155]  ? __switch_to_asm+0x41/0x70
[69173.871858]  ? __switch_to_asm+0x35/0x70
[69173.872566]  ib_uverbs_cmd_verbs+0x189/0x380 [ib_uverbs]
[69173.873490]  ? __switch_to_asm+0x41/0x70
[69173.874196]  ? __switch_to_asm+0x35/0x70
[69173.874898]  ? __switch_to_asm+0x41/0x70
[69173.875600]  ? __switch_to_asm+0x35/0x70
[69173.876315]  ? __switch_to+0x115/0x480
[69173.876999]  ? finish_task_switch+0x76/0x2b0
[69173.877765]  ? free_swap_slot+0x9a/0xf0
[69173.878446]  ? wp_page_reuse+0x4d/0x60
[69173.879128]  ? __raw_spin_unlock+0x5/0x10
[69173.879844]  ? do_wp_page+0x217/0x310
[69173.880501]  ? __handle_mm_fault+0x67e/0x6b0
[69173.881267]  ib_uverbs_ioctl+0xa3/0x100 [ib_uverbs]
[69173.882139]  do_vfs_ioctl+0xa4/0x630
[69173.882806]  ? __x64_sys_madvise+0x4a6/0x790
[69173.883573]  ? syscall_trace_enter+0x1d3/0x2c0
[69173.884356]  ksys_ioctl+0x60/0x90
[69173.884963]  __x64_sys_ioctl+0x16/0x20
[69173.885640]  do_syscall_64+0x5b/0x1b0
[69173.886311]  entry_SYSCALL_64_after_hwframe+0x65/0xca

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
parallel-scale test_statahead - trevis-12vm6, trevis-12vm7 crashed during parallel-scale test_statahead



 Comments   
Comment by Peter Jones [ 21/Jan/20 ]

Yang Sheng

Could you please investigate?

Thanks

Peter

Comment by Yang Sheng [ 03/Feb/20 ]

Hi, Yujian,

Do you know the 'panic_on_oom is enabled' is default set in rhel8.1 or we set it intentional on our test system?

Thanks,
YangSheng

Comment by Yang Sheng [ 11/Feb/20 ]

Duplicated with LU-11424

Generated at Sat Feb 10 02:58:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.