Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.0
-
None
-
autotest
-
3
-
9223372036854775807
Description
sanity test_51f times out with the usual message
'test failed to respond and timed out'
The only information in the test_log is
== sanity test 51f: check many open files limit ====================================================== 08:37:34 (1494923854) MDT0 numfree=711956, max=100000 changed ulimit from 1024 to 100020
There are two cases where sanity test 51f times out with oom. The stack trace is found in the client console
08:38:17:[ 1565.371695] Lustre: DEBUG MARKER: == sanity test 51f: check many open files limit ====================================================== 08:37:34 (1494923854) 08:38:17:[ 1602.443283] automount invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 08:38:17:[ 1602.446098] automount cpuset=/ mems_allowed=0 08:38:17:[ 1602.448513] CPU: 0 PID: 1010 Comm: automount Tainted: G OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 08:38:17:[ 1602.451566] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 08:38:17: 08:38:17:[ 1602.454058] ffff8800794eaf10 000000003457cee2 ffff88007b9e79f8 ffffffff81686ac3 08:38:17:[ 1602.459021] ffff88007b9e7a88 ffffffff81681a6e ffff88007b9e7af0 ffff88007b9e7a48 08:38:17:[ 1602.461680] ffffffff81692b6c ffffffff81a33cc0 0000000000000000 0000000000000000 08:38:17:[ 1602.464276] Call Trace: 08:38:17:[ 1602.466395] [<ffffffff81686ac3>] dump_stack+0x19/0x1b 08:38:17:[ 1602.468694] [<ffffffff81681a6e>] dump_header+0x8e/0x225 08:38:17:[ 1602.470982] [<ffffffff81692b6c>] ? notifier_call_chain+0x4c/0x70 08:38:17:[ 1602.473299] [<ffffffff810b6d98>] ? __blocking_notifier_call_chain+0x58/0x70 08:38:17:[ 1602.475696] [<ffffffff81184e1e>] check_panic_on_oom+0x2e/0x60 08:38:17:[ 1602.477970] [<ffffffff8118523b>] out_of_memory+0x23b/0x4f0 08:38:17:[ 1602.480183] [<ffffffff81682577>] __alloc_pages_slowpath+0x5d7/0x725 08:38:17:[ 1602.482651] [<ffffffff8118b5c5>] __alloc_pages_nodemask+0x405/0x420 08:38:17:[ 1602.484867] [<ffffffff811d272a>] alloc_pages_vma+0x9a/0x150 08:38:17:[ 1602.487039] [<ffffffff811c351b>] read_swap_cache_async+0xeb/0x160 08:38:17:[ 1602.489192] [<ffffffff811c3638>] swapin_readahead+0xa8/0x110 08:38:17:[ 1602.491293] [<ffffffff811b189c>] handle_mm_fault+0xb1c/0xfe0 08:38:17:[ 1602.493385] [<ffffffff81692644>] __do_page_fault+0x154/0x450 08:38:17:[ 1602.495536] [<ffffffff81692975>] do_page_fault+0x35/0x90 08:38:17:[ 1602.497619] [<ffffffff8168eb88>] page_fault+0x28/0x30 08:38:17:[ 1602.499598] Mem-Info: 08:38:17:[ 1602.501313] active_anon:350 inactive_anon:385 isolated_anon:0 08:38:17:[ 1602.501313] active_file:0 inactive_file:525 isolated_file:0 08:38:17:[ 1602.501313] unevictable:0 dirty:0 writeback:142 unstable:0 08:38:17:[ 1602.501313] slab_reclaimable:4058 slab_unreclaimable:36202 08:38:17:[ 1602.501313] mapped:0 shmem:52 pagetables:1613 bounce:0 08:38:17:[ 1602.501313] free:12916 free_pcp:7 free_cma:0 08:38:17:[ 1602.512345] Node 0 DMA free:7020kB min:416kB low:520kB high:624kB active_anon:400kB inactive_anon:640kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15892kB mlocked:0kB dirty:0kB writeback:36kB mapped:0kB shmem:228kB slab_reclaimable:108kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:20kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:15736 all_unreclaimable? yes 08:38:17:[ 1602.521429] lowmem_reserve[]: 0 1663 1663 1663 08:38:17:[ 1602.523300] Node 0 DMA32 free:44644kB min:44636kB low:55792kB high:66952kB active_anon:1000kB inactive_anon:900kB active_file:0kB inactive_file:2100kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080756kB managed:1705000kB mlocked:0kB dirty:0kB writeback:532kB mapped:0kB shmem:0kB slab_reclaimable:16124kB slab_unreclaimable:144096kB kernel_stack:2656kB pagetables:6432kB unstable:0kB bounce:0kB free_pcp:28kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 08:38:17:[ 1602.534026] lowmem_reserve[]: 0 0 0 0 08:38:17:[ 1602.535888] Node 0 DMA: 39*4kB (EM) 32*8kB (UEM) 25*16kB (UM) 12*32kB (UEM) 3*64kB (EM) 10*128kB (UM) 3*256kB (EM) 3*512kB (UEM) 2*1024kB (UE) 0*2048kB 0*4096kB = 7020kB 08:38:17:[ 1602.540391] Node 0 DMA32: 995*4kB (UEM) 623*8kB (UEM) 319*16kB (UEM) 861*32kB (UEM) 46*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44564kB 08:38:17:[ 1602.544824] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB 08:38:17:[ 1602.547006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 08:38:17:[ 1602.549124] 975 total pagecache pages 08:38:17:[ 1602.550875] 375 pages in swap cache 08:38:17:[ 1602.552582] Swap cache stats: add 26733, delete 26358, find 248/330 08:38:17:[ 1602.554487] Free swap = 3566956kB 08:38:17:[ 1602.556164] Total swap = 3671036kB 08:38:17:[ 1602.557826] 524185 pages RAM 08:38:17:[ 1602.559435] 0 pages HighMem/MovableOnly 08:38:17:[ 1602.561132] 93962 pages reserved 08:38:17:[ 1602.562747] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name 08:38:17:[ 1602.564748] [ 343] 0 343 9303 1 20 92 0 systemd-journal 08:38:17:[ 1602.566845] [ 365] 0 365 29723 0 26 85 0 lvmetad 08:38:17:[ 1602.568879] [ 382] 0 382 10895 1 21 134 -1000 systemd-udevd 08:38:17:[ 1602.570936] [ 460] 0 460 13854 0 28 108 -1000 auditd 08:38:17:[ 1602.572986] [ 492] 0 492 4793 18 14 37 0 irqbalance 08:38:17:[ 1602.575027] [ 493] 998 493 134941 0 53 2131 0 polkitd 08:38:17:[ 1602.577041] [ 496] 81 496 9197 0 20 129 -900 dbus-daemon 08:38:17:[ 1602.579085] [ 509] 997 509 28962 0 28 98 0 chronyd 08:38:17:[ 1602.581105] [ 559] 0 559 67227 0 43 657 0 gssproxy 08:38:17:[ 1602.583118] [ 603] 0 603 64343 0 74 266 0 sssd 08:38:17:[ 1602.585096] [ 604] 0 604 90858 0 118 570 0 sssd_be 08:38:17:[ 1602.587095] [ 605] 0 605 127883 0 68 1074 0 NetworkManager 08:38:17:[ 1602.589144] [ 612] 0 612 30625 1 30 207 0 rpc.gssd 08:38:17:[ 1602.591150] [ 619] 0 619 65752 5 84 227 0 sssd_nss 08:38:17:[ 1602.593151] [ 620] 0 620 60653 0 73 238 0 sssd_pam 08:38:17:[ 1602.595417] [ 621] 0 621 59047 5 71 221 0 sssd_ssh 08:38:17:[ 1602.598098] [ 622] 0 622 67553 0 83 330 0 sssd_pac 08:38:17:[ 1602.600145] [ 635] 0 635 6048 1 16 77 0 systemd-logind 08:38:17:[ 1602.602201] [ 641] 0 641 28220 1 58 3120 0 dhclient 08:38:17:[ 1602.604259] [ 859] 0 859 138287 62 87 2611 0 tuned 08:38:17:[ 1602.606279] [ 860] 0 860 77477 0 46 285 0 rsyslogd 08:38:17:[ 1602.608319] [ 861] 0 861 26370 1 54 248 -1000 sshd 08:38:17:[ 1602.610259] [ 868] 0 868 6776 1 17 62 0 xinetd 08:38:17:[ 1602.612235] [ 875] 995 875 56858 0 23 263 0 munged 08:38:17:[ 1602.614120] [ 989] 0 989 22780 0 43 262 0 master 08:38:17:[ 1602.615980] [ 991] 89 991 25347 0 45 253 0 pickup 08:38:17:[ 1602.617764] [ 992] 89 992 25364 0 47 254 0 qmgr 08:38:17:[ 1602.619536] [ 1007] 0 1007 31555 0 18 155 0 crond 08:38:17:[ 1602.621255] [ 1009] 0 1009 149996 0 63 498 0 automount 08:38:17:[ 1602.623021] [ 1012] 0 1012 6461 0 17 51 0 atd 08:38:17:[ 1602.624633] [ 1043] 0 1043 27509 1 10 32 0 agetty 08:38:17:[ 1602.626296] [ 1045] 0 1045 27509 1 12 32 0 agetty 08:38:17:[ 1602.627878] [ 2247] 0 2247 40560 0 79 347 0 sshd 08:38:17:[ 1602.629439] [ 2249] 0 2249 28283 0 14 58 0 run_test.sh 08:38:17:[ 1602.630986] [ 2461] 0 2461 29424 28 15 671 0 bash 08:38:17:[ 1602.632479] [22429] 0 22429 29424 23 13 675 0 bash 08:38:17:[ 1602.633907] [22430] 0 22430 26975 0 9 27 0 tee 08:38:17:[ 1602.635313] [22600] 0 22600 29793 39 14 1016 0 bash 08:38:17:[ 1602.636706] [17680] 0 17680 29793 59 14 999 0 bash 08:38:17:[ 1602.638107] [17681] 0 17681 26975 1 9 26 0 tee 08:38:17:[ 1602.639460] [17855] 0 17855 2076 9 9 19 0 createmany 08:38:17:[ 1602.640891] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
Logs for these two cases are at:
2017-05-16 - https://testing.hpdd.intel.com/test_sets/4f3d47bc-3a42-11e7-b81e-5254006e85c2
2017-05-16 - https://testing.hpdd.intel.com/test_sets/6d1be1a2-3a43-11e7-b861-5254006e85c2