[LU-11410] recovery-mds-scale test failover_mds crashes with ‘ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)’ Created: 20/Sep/18  Updated: 26/Jan/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.6, Lustre 2.12.1, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.9
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: sles12, suse
Environment:

SLES12 SP3 clients


Issue Links:
Related
is related to LU-11724 recovery tests crash with ‘page alloc... Open
is related to LU-12067 recovery-mds-scale test failover_mds ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

A client crashes in recovery-mds-scale test_failover_mds. Looking at https://testing.whamcloud.com/test_sets/3e369a98-b8bf-11e8-a7de-52540065bddc, in the kernel crash log, we see

[  766.998879] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting...
[  782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814
[  782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp)
[  782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301
[  845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
[  845.630613] ntpd cpuset=/ mems_allowed=0
[  845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
[  845.630629] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  845.630634]  0000000000000000 ffffffff8132ad80 ffff88007be939a0 0000000000000000
[  845.630637]  ffffffff8120935e 0000000000000000 0000000000000000 0000000000000000
[  845.630639]  0000000000000000 ffffffff810a0927 ffffffff81e9aa20 0000000000000000
[  845.630639] Call Trace:
[  845.630702]  [<ffffffff81019ac9>] dump_trace+0x59/0x340
[  845.630711]  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
[  845.630714]  [<ffffffff8101ac71>] show_stack+0x21/0x40
[  845.630727]  [<ffffffff8132ad80>] dump_stack+0x5c/0x7c
[  845.630748]  [<ffffffff8120935e>] dump_header+0x82/0x215
[  845.630762]  [<ffffffff81198079>] check_panic_on_oom+0x29/0x50
[  845.630770]  [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0
[  845.630777]  [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70
[  845.630786]  [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220
[  845.630799]  [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150
[  845.630805]  [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30
[  845.630808]  [<ffffffff811d727d>] swapin_readahead+0xfd/0x190
[  845.630814]  [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670
[  845.630820]  [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640
[  845.630828]  [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0
[  845.630838]  [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120
[  845.630850]  [<ffffffff8161da62>] async_page_fault+0x32/0x60
[  845.633602] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
[  845.633602] 
[  845.633603] Leftover inexact backtrace:
               
[  845.633621]  [<ffffffff81338d61>] ? __clear_user+0x21/0x50
[  845.633624]  [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0
[  845.633625]  [<ffffffff810176d1>] ? do_signal+0x511/0x5b0
[  845.633627]  [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0
[  845.633634]  [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2
[  845.633638]  [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0
[  845.633644]  [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3
[  845.633661] Mem-Info:
[  845.633668] active_anon:25 inactive_anon:41 isolated_anon:0
                active_file:69559 inactive_file:371186 isolated_file:0
                unevictable:20 dirty:67 writeback:850 unstable:0
                slab_reclaimable:2733 slab_unreclaimable:8762
                mapped:7090 shmem:26 pagetables:966 bounce:0
                free:13103 free_pcp:0 free_cma:0
[  845.633676] Node 0 DMA free:7736kB min:376kB low:468kB high:560kB active_anon:100kB inactive_anon:104kB active_file:744kB inactive_file:6464kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:96kB mapped:256kB shmem:104kB slab_reclaimable:20kB slab_unreclaimable:284kB kernel_stack:32kB pagetables:12kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:46916 all_unreclaimable? yes
[  845.633678] lowmem_reserve[]: 0 1843 1843 1843 1843
[  845.633683] Node 0 DMA32 free:44676kB min:44676kB low:55844kB high:67012kB active_anon:0kB inactive_anon:60kB active_file:277492kB inactive_file:1478280kB unevictable:80kB isolated(anon):0kB isolated(file):0kB present:2080744kB managed:1900772kB mlocked:80kB dirty:268kB writeback:3304kB mapped:28104kB shmem:0kB slab_reclaimable:10912kB slab_unreclaimable:34764kB kernel_stack:2608kB pagetables:3852kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11784060 all_unreclaimable? yes
[  845.633686] lowmem_reserve[]: 0 0 0 0 0
[  845.633694] Node 0 DMA: 8*4kB (UME) 5*8kB (ME) 3*16kB (UE) 2*32kB (UE) 2*64kB (U) 2*128kB (UE) 2*256kB (ME) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7736kB
[  845.633700] Node 0 DMA32: 916*4kB (UME) 559*8kB (UME) 471*16kB (UME) 268*32kB (UME) 162*64kB (UE) 65*128kB (UM) 7*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44728kB
[  845.633713] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  845.633720] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  845.633721] 13843 total pagecache pages
[  845.633722] 0 pages in swap cache
[  845.633722] Swap cache stats: add 10178, delete 10178, find 72/95
[  845.633723] Free swap  = 14297524kB
[  845.633725] Total swap = 14338044kB
[  845.633726] 524184 pages RAM
[  845.633726] 0 pages HighMem/MovableOnly
[  845.633726] 45015 pages reserved
[  845.633727] 0 pages hwpoisoned
[  845.633727] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  845.633824] [  361]     0   361    10933      686      24       3     1623             0 systemd-journal
[  845.633827] [  400]   495   400    13124      940      29       3      116             0 rpcbind
[  845.633836] [  404]     0   404     9268      710      21       3      218         -1000 systemd-udevd
[  845.633839] [  479]     0   479     4814      612      14       3       58             0 irqbalance
[  845.633847] [  484]   499   484    13452      876      26       3      146          -900 dbus-daemon
[  845.633849] [  528]     0   528    25126     1025      50       3      207             0 sssd
[  845.633851] [  533]     0   533    32270     1881      64       3      288             0 sssd_be
[  845.633857] [  538]     0   538     7447     1060      19       3      260             0 wickedd-dhcp6
[  845.633860] [  548]     0   548    25845     1531      54       4      180             0 sssd_nss
[  845.633862] [  549]     0   549    20713     1198      45       3      175             0 sssd_pam
[  845.633863] [  550]     0   550    19112     1161      43       3      166             0 sssd_ssh
[  845.633870] [  552]     0   552     7448     1028      18       3      256             0 wickedd-auto4
[  845.633876] [  553]     0   553     7448     1085      20       3      265             0 wickedd-dhcp4
[  845.633880] [  556]     0   556    84318      905      38       3      269             0 rsyslogd
[  845.633927] [  761]     0   761     7480     1056      20       3      287             0 wickedd
[  845.633934] [  764]     0   764     7455     1032      18       3      276             0 wickedd-nanny
[  845.633939] [ 1418]     0  1418     2141      422      10       3       40             0 xinetd
[  845.633944] [ 1461]    74  1461     8408      974      17       3      164             0 ntpd
[  845.633953] [ 1473]    74  1473     9461      590      18       3      153             0 ntpd
[  845.633957] [ 1477]     0  1477    16586     1539      35       3      180         -1000 sshd
[  845.633964] [ 1492]   493  1492    55352      616      20       3      231             0 munged
[  845.633972] [ 1517]     0  1517     1664      438       8       3       30             0 agetty
[  845.633977] [ 1518]     0  1518     1664      419       9       3       29             0 agetty
[  845.633981] [ 1534]     0  1534   147220     1574      59       3      347             0 automount
[  845.633986] [ 1570]     0  1570     5513      629      16       3       64             0 systemd-logind
[  845.633988] [ 1809]     0  1809     8861      822      20       3      109             0 master
[  845.633990] [ 1820]    51  1820    12439     1046      25       3      108             0 pickup
[  845.633992] [ 1823]    51  1823    12536     1354      26       3      174             0 qmgr
[  845.633994] [ 1864]     0  1864     5197      532      18       3      150             0 cron
[  845.634043] [15714]     0 15714    17465      669      35       3      174             0 in.mrshd
[  845.634047] [15715]     0 15715     2894      572      10       3       77             0 bash
[  845.634051] [15720]     0 15720     2894      427      10       3       78             0 bash
[  845.634053] [15721]     0 15721     3034      585      12       3      219             0 run_dd.sh
[  845.634057] [16387]    51 16387    12675     1323      25       3      335             0 trivial-rewrite
[  845.634059] [16388]    51 16388    16918     1732      35       3      244             0 smtp
[  845.634064] [16434]     0 16434     1062      182       8       3       26             0 dd
[  845.634068] [16437]    51 16437    12447     1022      24       3      109             0 bounce
[  845.634074] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
               
[  845.634075] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
[  845.634076] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  845.634078]  0000000000000000 ffffffff8132ad80 ffffffff81a28298 ffff88007be938c8
[  845.634079]  ffffffff81191f31 0000000000000010 ffff88007be938d8 ffff88007be93878
[  845.634081]  000000000000309f ffffffff81a2c56b 0000000000000000 0000000000000000
[  845.634081] Call Trace:
[  845.634087]  [<ffffffff81019ac9>] dump_trace+0x59/0x340
[  845.634090]  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
[  845.634092]  [<ffffffff8101ac71>] show_stack+0x21/0x40
[  845.634095]  [<ffffffff8132ad80>] dump_stack+0x5c/0x7c
[  845.634101]  [<ffffffff81191f31>] panic+0xd2/0x232
[  845.634104]  [<ffffffff811980a0>] check_panic_on_oom+0x50/0x50
[  845.634106]  [<ffffffff8119821a>] out_of_memory+0x17a/0x4a0
[  845.634112]  [<ffffffff8119cc48>] __alloc_pages_nodemask+0xaf8/0xb70
[  845.634116]  [<ffffffff811e6cc4>] alloc_pages_vma+0xa4/0x220
[  845.634119]  [<ffffffff811d70f0>] __read_swap_cache_async+0xf0/0x150
[  845.634123]  [<ffffffff811d7164>] read_swap_cache_async+0x14/0x30
[  845.634125]  [<ffffffff811d727d>] swapin_readahead+0xfd/0x190
[  845.634128]  [<ffffffff811c3771>] handle_pte_fault+0x12b1/0x1670
[  845.634132]  [<ffffffff811c56aa>] handle_mm_fault+0x2fa/0x640
[  845.634135]  [<ffffffff81067d7a>] __do_page_fault+0x23a/0x4b0
[  845.634139]  [<ffffffff8106809c>] trace_do_page_fault+0x3c/0x120
[  845.634141]  [<ffffffff8161da62>] async_page_fault+0x32/0x60
[  845.636233] DWARF2 unwinder stuck at async_page_fault+0x32/0x60
[  845.636233] 
[  845.636234] Leftover inexact backtrace:
               
[  845.636236]  [<ffffffff81338d61>] ? __clear_user+0x21/0x50
[  845.636238]  [<ffffffff810230f2>] ? copy_fpstate_to_sigframe+0x112/0x1a0
[  845.636239]  [<ffffffff810176d1>] ? do_signal+0x511/0x5b0
[  845.636241]  [<ffffffff81067d9a>] ? __do_page_fault+0x25a/0x4b0
[  845.636243]  [<ffffffff8107bf4e>] ? exit_to_usermode_loop+0x70/0xc2
[  845.636246]  [<ffffffff81003ae5>] ? syscall_return_slowpath+0x85/0xa0
[  845.636248]  [<ffffffff8161aa3a>] ? int_ret_from_sys_call+0x25/0xa3

In the client (vm3) console log, we see where the vmcore is located

[  782.297602] Lustre: Evicted from MGS (at MGC10.9.6.25@tcp_1) after server handle changed from 0x66e2519c6be9cc2 to 0x89680f107ea4b814
[  782.299262] Lustre: MGC10.9.6.25@tcp: Connection restored to MGC10.9.6.25@tcp_1 (at 10.9.6.26@tcp)
[  782.362888] LustreError: 13367:0:(client.c:3000:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff88006690c940 x1611617063142688/t4294967305(4294967305) o101->lustre-MDT0000-mdc-ffff88007bb5e800@10.9.6.26@tcp:12/10 lens 952/560 e 0 to 0 dl 1536957913 ref 2 fl Interpret:RP/4/0 rc 301/301
[  845.630602] ntpd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
[  845.630613] ntpd cpuset=/ mems_allowed=0
[  845.630628] CPU: 1 PID: 1461 Comm: ntpd Tainted: G           OE   N  4.4.143-94.47-default #1
[ [    2.826867] RPC: Registered named UNIX socket transport module.
[    2.826869] RPC: Registered udp transport module.
…
The dumpfile is saved to /mnt/trevis-2.trevis.whamcloud.com/export/scratch/dumps/trevis-45vm3.trevis.whamcloud.com/10.9.6.21-2018-09-14-13:48/vmcore.

makedumpfile Completed.
-------------------------------------------------------------------------------

All failures are seen on SLES12 SP3 server/client and SLES12 SP3 client/CentOS 7 server testing.

We’ve seen this crash a few times in the past
https://testing.whamcloud.com/test_sets/9a10e716-b882-11e8-b86b-52540065bddc
https://testing.whamcloud.com/test_sets/74705770-b9f5-11e8-8c12-52540065bddc
https://testing.whamcloud.com/test_sets/6f5b3134-bc38-11e8-a7de-52540065bddc



 Comments   
Comment by James Nunez (Inactive) [ 29/Apr/19 ]

Another oom for recovery-mds-scale in test failover_mds at https://testing.whamcloud.com/test_sets/b81d5294-6692-11e9-8bb1-52540065bddc .

From the console log of the client (vm3) running dd

[ 2025.771743] Lustre: DEBUG MARKER: mds1 has failed over 2 times, and counting...
[ 2025.911466] Lustre: lustre-MDT0000-mdc-ffff8a12b785f800: Connection restored to 10.9.5.124@tcp (at 10.9.5.124@tcp)
[ 2119.724896] irqbalance invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null),  order=0, oom_score_adj=0
[ 2119.727062] irqbalance cpuset=/ mems_allowed=0
[ 2119.727923] CPU: 1 PID: 465 Comm: irqbalance Tainted: G           OE      4.12.14-95.13-default #1 SLE12-SP4
[ 2119.729670] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2119.730728] Call Trace:
[ 2119.731265]  dump_stack+0x5a/0x75
[ 2119.731928]  dump_header+0x9c/0x238
[ 2119.732628]  ? notifier_call_chain+0x47/0x70
[ 2119.733455]  ? __blocking_notifier_call_chain+0x51/0x60
[ 2119.734432]  out_of_memory+0x44b/0x490
[ 2119.735165]  __alloc_pages_slowpath+0x7e5/0xa0d
[ 2119.736017]  __alloc_pages_nodemask+0x1e9/0x210
[ 2119.736881]  alloc_pages_vma+0x92/0x200
[ 2119.737633]  __read_swap_cache_async+0x140/0x210
[ 2119.738515]  read_swap_cache_async+0x14/0x30
[ 2119.739335]  swapin_readahead+0x107/0x1f0
[ 2119.740111]  do_swap_page+0x2b8/0x8b0
[ 2119.740830]  ? __switch_to_asm+0x34/0x70
[ 2119.741595]  ? __switch_to_asm+0x40/0x70
[ 2119.742364]  ? __switch_to+0x10c/0x4a0
[ 2119.743099]  __handle_mm_fault+0x783/0xef0
[ 2119.743882]  handle_mm_fault+0xc4/0x1d0
[ 2119.744639]  __do_page_fault+0x1f3/0x4c0
[ 2119.745401]  trace_do_page_fault+0x40/0x120
[ 2119.746204]  ? async_page_fault+0x2f/0x50
[ 2119.746971]  async_page_fault+0x45/0x50
[ 2119.747721] RIP: 0002:0x55859b2a4ba0
[ 2119.748428] RSP: 000a:000055859b2a4b8c EFLAGS: 7ffd17f0ec30
[ 2119.748449] Mem-Info:
[ 2119.750018] active_anon:0 inactive_anon:0 isolated_anon:0
[ 2119.750018]  active_file:279857 inactive_file:156646 isolated_file:192
[ 2119.750018]  unevictable:20 dirty:6162 writeback:0 unstable:0
[ 2119.750018]  slab_reclaimable:3259 slab_unreclaimable:9056
[ 2119.750018]  mapped:2383 shmem:0 pagetables:940 bounce:0
[ 2119.750018]  free:13061 free_pcp:15 free_cma:0
[ 2119.755522] Node 0 active_anon:0kB inactive_anon:0kB active_file:1119428kB inactive_file:626584kB unevictable:80kB isolated(anon):0kB isolated(file):768kB mapped:9532kB dirty:24648kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[ 2119.760194] Node 0 DMA free:7640kB min:380kB low:472kB high:564kB active_anon:0kB inactive_anon:0kB active_file:8244kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:24kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 2119.765249] lowmem_reserve[]: 0 1822 1822 1822 1822
[ 2119.766189] Node 0 DMA32 free:44604kB min:44672kB low:55840kB high:67008kB active_anon:0kB inactive_anon:0kB active_file:1111184kB inactive_file:626500kB unevictable:80kB writepending:24648kB present:2080744kB managed:1885860kB mlocked:80kB slab_reclaimable:13036kB slab_unreclaimable:36200kB kernel_stack:2176kB pagetables:3760kB bounce:0kB free_pcp:60kB local_pcp:60kB free_cma:0kB
[ 2119.771819] lowmem_reserve[]: 0 0 0 0 0
[ 2119.772579] Node 0 DMA: 6*4kB (UM) 6*8kB (UM) 3*16kB (U) 5*32kB (U) 5*64kB (UM) 3*128kB (UM) 2*256kB (UM) 2*512kB (UM) 1*1024kB (M) 0*2048kB 1*4096kB (E) = 7640kB
[ 2119.775106] Node 0 DMA32: 1131*4kB (UME) 608*8kB (UME) 527*16kB (UME) 341*32kB (UME) 162*64kB (UM) 43*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44604kB
[ 2119.777668] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 2119.779217] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2119.780702] 11034 total pagecache pages
[ 2119.781467] 0 pages in swap cache
[ 2119.782135] Swap cache stats: add 15894, delete 15894, find 3269/5072
[ 2119.783293] Free swap  = 14295036kB
[ 2119.783980] Total swap = 14338044kB
[ 2119.784672] 524184 pages RAM
[ 2119.785271] 0 pages HighMem/MovableOnly
[ 2119.786018] 48742 pages reserved
[ 2119.786677] 0 pages hwpoisoned
[ 2119.787298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2119.788805] [  359]     0   359    10934      518      24       3     1130             0 systemd-journal
[ 2119.790480] [  373]     0   373     3008        1      12       3     1452             0 haveged
[ 2119.792100] [  380]     0   380    10408      451      23       3      239         -1000 systemd-udevd
[ 2119.793748] [  381]   495   381    13124        1      30       3      126             0 rpcbind
[ 2119.795354] [  441]   499   441    10913        0      25       3      153          -900 dbus-daemon
[ 2119.796967] [  461]     0   461     7469        5      19       3      269             0 wickedd-dhcp6
[ 2119.798612] [  462]     0   462    28597       61      59       3      247             0 sssd
[ 2119.800144] [  465]     0   465     4814      211      14       3       58             0 irqbalance
[ 2119.801754] [  466]     0   466     7470        4      20       3      272             0 wickedd-dhcp4
[ 2119.803406] [  467]     0   467     7469        2      20       3      269             0 wickedd-auto4
[ 2119.805054] [  478]     0   478     7500        1      20       3      312             0 wickedd
[ 2119.806615] [  512]     0   512     7476        0      20       3      277             0 wickedd-nanny
[ 2119.808263] [  517]     0   517    84318      115      39       3      306             0 rsyslogd
[ 2119.809845] [  521]     0   521    34903      526      68       3      331             0 sssd_be
[ 2119.811408] [  531]     0   531    26453      553      56       3      234             0 sssd_nss
[ 2119.813008] [  532]     0   532    27004      135      56       3      226             0 sssd_pam
[ 2119.814598] [  533]     0   533    25887      136      54       4      209             0 sssd_ssh
[ 2119.816188] [ 1219]     0  1219     2141      265      10       3       41             0 xinetd
[ 2119.817729] [ 1247]     0  1247    16601        1      37       3      179         -1000 sshd
[ 2119.819244] [ 1250]    74  1250     5882      327      17       3      162             0 ntpd
[ 2119.820752] [ 1253]    74  1253     6935        1      18       3      153             0 ntpd
[ 2119.822279] [ 1276]   493  1276    55367      396      19       3      245             0 munged
[ 2119.823848] [ 1336]     0  1336   163871      340      62       4      374             0 automount
[ 2119.825490] [ 1370]     0  1370     1665        1       9       3       27             0 agetty
[ 2119.827047] [ 1372]     0  1372     1665        1       9       3       29             0 agetty
[ 2119.828599] [ 1401]     0  1401     5514        1      16       3       80             0 systemd-logind
[ 2119.830287] [ 1575]     0  1575     8863       58      21       3      127             0 master
[ 2119.831840] [ 1588]    51  1588     9900      189      24       3      109             0 pickup
[ 2119.833401] [ 1589]    51  1589     9997      212      23       3      169             0 qmgr
[ 2119.834921] [ 1613]     0  1613     5198      273      15       3      153             0 cron
[ 2119.836474] [19713]     0 19713    14926        1      35       3      175             0 in.mrshd
[ 2119.838057] [19714]     0 19714     2894        0      11       3       78             0 bash
[ 2119.839572] [19719]     0 19719     2894        0      11       3       79             0 bash
[ 2119.841100] [19720]     0 19720     3034      356      11       3      215             0 run_dd.sh
[ 2119.842724] [21185]     0 21185     1062      300       8       3       33             0 dd
[ 2119.844231] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[ 2119.844231] 
Comment by Alena Nikitenko [ 03/Dec/21 ]

Similar oom, but on CentOS 7.9 in recovery-random-scale test set on 2.12.8: https://testing.whamcloud.com/test_sets/22e46ed4-50a5-4a25-b830-c798ce17b9e6 

[  872.358394] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-124vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
[  874.874723] Lustre: DEBUG MARKER: onyx-124vm3.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 1475
[  911.214431] ntpd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
[  911.225802] ntpd cpuset=/ mems_allowed=0
[  911.226442] CPU: 1 PID: 496 Comm: ntpd Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.45.1.el7.x86_64 #1
[  911.228117] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  911.228997] Call Trace:
[  911.229462]  [<ffffffff81b83539>] dump_stack+0x19/0x1b
[  911.230260]  [<ffffffff81b7e5d8>] dump_header+0x90/0x229
[  911.231088]  [<ffffffff81b90b6f>] ? notifier_call_chain+0x4f/0x70
[  911.232048]  [<ffffffff814cc228>] ? __blocking_notifier_call_chain+0x58/0x70
[  911.233143]  [<ffffffff815c273e>] check_panic_on_oom+0x2e/0x60
[  911.234044]  [<ffffffff815c2ab4>] out_of_memory+0x194/0x500
[  911.234908]  [<ffffffff815c9854>] __alloc_pages_nodemask+0xad4/0xbe0
[  911.235890]  [<ffffffff8161cc49>] alloc_pages_vma+0xa9/0x200
[  911.236772]  [<ffffffff8160a1e5>] __read_swap_cache_async+0x115/0x190
[  911.237758]  [<ffffffff8160a286>] read_swap_cache_async+0x26/0x60
[  911.238699]  [<ffffffff8160a46b>] swapin_readahead+0x1ab/0x210
[  911.239621]  [<ffffffff8178dcd2>] ? radix_tree_lookup_slot+0x22/0x50
[  911.240604]  [<ffffffff815bd91e>] ? __find_get_page+0x1e/0xa0
[  911.241495]  [<ffffffff815f288f>] do_swap_page+0x23f/0x7c0
[  911.242366]  [<ffffffff816655dd>] ? core_sys_select+0x26d/0x340
[  911.243283]  [<ffffffff815f6627>] handle_mm_fault+0xaa7/0xfb0
[  911.244178]  [<ffffffff81629015>] ? kmem_cache_alloc+0x35/0x1f0
[  911.245107]  [<ffffffff81a39b99>] ? sk_prot_alloc+0x39/0x190
[  911.245978]  [<ffffffff81b90653>] __do_page_fault+0x213/0x500
[  911.246866]  [<ffffffff81b90a26>] trace_do_page_fault+0x56/0x150
[  911.247791]  [<ffffffff81b8ffa2>] do_async_page_fault+0x22/0xf0
[  911.248698]  [<ffffffff81b8c7a8>] async_page_fault+0x28/0x30
[  911.249569] Mem-Info:
[  911.249937] active_anon:1722 inactive_anon:1749 isolated_anon:0
[  911.249937]  active_file:54406 inactive_file:579983 isolated_file:64
[  911.249937]  unevictable:0 dirty:0 writeback:0 unstable:0
[  911.249937]  slab_reclaimable:3354 slab_unreclaimable:5635
[  911.249937]  mapped:5514 shmem:2168 pagetables:1086 bounce:0
[  911.249937]  free:13926 free_pcp:19 free_cma:0
[  911.254845] Node 0 DMA free:10912kB min:260kB low:324kB high:388kB active_anon:36kB inactive_anon:84kB active_file:256kB inactive_file:3876kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:88kB shmem:84kB slab_reclaimable:56kB slab_unreclaimable:84kB kernel_stack:48kB pagetables:60kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7171 all_unreclaimable? yes
[  911.261249] lowmem_reserve[]: 0 2668 2668 2668
[  911.262095] Node 0 DMA32 free:44792kB min:44792kB low:55988kB high:67188kB active_anon:6852kB inactive_anon:6912kB active_file:217368kB inactive_file:2316156kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:3129320kB managed:2735424kB mlocked:0kB dirty:0kB writeback:0kB mapped:21968kB shmem:8588kB slab_reclaimable:13360kB slab_unreclaimable:22456kB kernel_stack:2432kB pagetables:4284kB unstable:0kB bounce:0kB free_pcp:76kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5594703 all_unreclaimable? yes
[  911.269030] lowmem_reserve[]: 0 0 0 0
[  911.269765] Node 0 DMA: 4*4kB (UE) 4*8kB (UE) 5*16kB (UEM) 5*32kB (UE) 4*64kB (UM) 1*128kB (E) 2*256kB (EM) 3*512kB (UEM) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 10912kB
[  911.272831] Node 0 DMA32: 243*4kB (UEM) 378*8kB (UEM) 523*16kB (UE) 291*32kB (UEM) 152*64kB (UEM) 86*128kB (UEM) 9*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44716kB
[  911.275753] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  911.277073] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  911.278353] 14131 total pagecache pages
[  911.278950] 538 pages in swap cache
[  911.279502] Swap cache stats: add 11707, delete 11169, find 1362/1879
[  911.280485] Free swap  = 2711036kB
[  911.281018] Total swap = 2753532kB
[  911.281555] 786328 pages RAM
[  911.282006] 0 pages HighMem/MovableOnly
[  911.282605] 98495 pages reserved
[  911.283106] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  911.284298] [  346]     0   346     9860      774      24       59             0 systemd-journal
[  911.285626] [  367]     0   367    29161      241      27       78             0 lvmetad
[  911.286856] [  370]     0   370    11413      370      23      121         -1000 systemd-udevd
[  911.288154] [  463]     0   463    13883      161      28      100         -1000 auditd
[  911.289377] [  489]     0   489     6596      398      19       42             0 systemd-logind
[  911.290687] [  492]   999   492   153058     1234      62     1843             0 polkitd
[  911.291910] [  493]    81   493    14560      529      32       94          -900 dbus-daemon
[  911.293188] [  495]    32   495    17314      208      38      140             0 rpcbind
[  911.294413] [  496]    38   496    11825      457      29      153             0 ntpd
[  911.295599] [  500]     0   500   118583     1821      87      867             0 NetworkManager
[  911.296923] [  501]     0   501     5385      272      16       41             0 irqbalance
[  911.298199] [  508]     0   508    48801      195      36      130             0 gssproxy
[  911.299444] [  839]     0   839    28246      840      58      259         -1000 sshd
[  911.300640] [  841]     0   841   143570     1595      99     2726             0 tuned
[  911.301835] [  847]     0   847    54100      742      42      673             0 rsyslogd
[  911.303075] [  848]   997   848    56473      403      22      128             0 munged
[  911.304286] [  859]     0   859     6792      196      19       63             0 xinetd
[  911.305505] [  861]    29   861    10610      222      26      209             0 rpc.statd
[  911.306756] [  912]     0   912   155891     1007      79      907             0 automount
[  911.308001] [  921]     0   921    31595      240      21      154             0 crond
[  911.309199] [  927]     0   927     6477      189      18       52             0 atd
[  911.310380] [  941]     0   941    27551      181      10       33             0 agetty
[  911.311605] [  942]     0   942    27551      184      11       32             0 agetty
[  911.312827] [ 1247]     0  1247    22447      282      44      256             0 master
[  911.314040] [ 1258]    89  1258    22473      747      44      251             0 pickup
[  911.315249] [ 1259]    89  1259    22490      751      44      253             0 qmgr
[  911.316452] [23060]     0 23060    21124      457      48      206             0 in.mrshd
[  911.317691] [23065]     0 23065    28320      334      13       70             0 bash
[  911.318885] [23129]     0 23129    28320       97      11       71             0 bash
[  911.320078] [23130]     0 23130    28390      391      14       75             0 run_dd.sh
[  911.321331] [24014]     0 24014    27024      155      12        0             0 dd
[  911.322500] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[  911.322500] 
[  911.323962] CPU: 1 PID: 496 Comm: ntpd Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.45.1.el7.x86_64 #1
[  911.325626] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  911.326494] Call Trace:
[  911.326882]  [<ffffffff81b83539>] dump_stack+0x19/0x1b
[  911.327670]  [<ffffffff81b7d241>] panic+0xe8/0x21f
[  911.328407]  [<ffffffff815c2765>] check_panic_on_oom+0x55/0x60
[  911.329288]  [<ffffffff815c2ab4>] out_of_memory+0x194/0x500
[  911.330134]  [<ffffffff815c9854>] __alloc_pages_nodemask+0xad4/0xbe0
[  911.331094]  [<ffffffff8161cc49>] alloc_pages_vma+0xa9/0x200
[  911.331963]  [<ffffffff8160a1e5>] __read_swap_cache_async+0x115/0x190
[  911.332935]  [<ffffffff8160a286>] read_swap_cache_async+0x26/0x60
[  911.333860]  [<ffffffff8160a46b>] swapin_readahead+0x1ab/0x210
[  911.334750]  [<ffffffff8178dcd2>] ? radix_tree_lookup_slot+0x22/0x50
[  911.335712]  [<ffffffff815bd91e>] ? __find_get_page+0x1e/0xa0
[  911.336580]  [<ffffffff815f288f>] do_swap_page+0x23f/0x7c0
[  911.337416]  [<ffffffff816655dd>] ? core_sys_select+0x26d/0x340
[  911.338308]  [<ffffffff815f6627>] handle_mm_fault+0xaa7/0xfb0
[  911.339181]  [<ffffffff81629015>] ? kmem_cache_alloc+0x35/0x1f0
[  911.340088]  [<ffffffff81a39b99>] ? sk_prot_alloc+0x39/0x190
[  911.340948]  [<ffffffff81b90653>] __do_page_fault+0x213/0x500
[  911.341821]  [<ffffffff81b90a26>] trace_do_page_fault+0x56/0x150
[  911.342732]  [<ffffffff81b8ffa2>] do_async_page_fault+0x22/0xf0
[  911.343636]  [<ffffffff81b8c7a8>] async_page_fault+0x28/0x30 
Comment by Sarah Liu [ 15/Jun/22 ]

+2
https://testing.whamcloud.com/test_sessions/cefc5b87-5ded-4277-ae0e-c73867bd444a
https://testing.whamcloud.com/test_sessions/2c59b847-3d0d-4b71-8431-e0aa2b59e306

Generated at Sat Feb 10 02:43:38 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.