[LU-6914] replay-single test 80b hangs with MDS OOM Created: 27/Jul/15 Updated: 10/Oct/21 Resolved: 10/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
review-dne-part-2 in autotest |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Replay-single test 80b hangs and times out in review-dne-part-2. Logs are at https://testing.hpdd.intel.com/test_sets/1791cbe4-32b5-11e5-8214-5254006e85c2 . From the test_log, we see there’s a problem with MDT0001: onyx-43vm6: CMD: onyx-43vm6.onyx.hpdd.intel.com lctl get_param -n at_max onyx-43vm5: CMD: onyx-43vm5.onyx.hpdd.intel.com lctl get_param -n at_max onyx-43vm6: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT onyx-43vm5: rpc : @@@@@@ FAIL: can't put import for mdc.lustre-MDT0001-mdc-*.mds_server_uuid into FULL state after 1475 sec, have REPLAY_WAIT onyx-43vm5: Trace dump: onyx-43vm5: = /usr/lib64/lustre/tests/test-framework.sh:4727:error_noexit() onyx-43vm5: = /usr/lib64/lustre/tests/test-framework.sh:4758:error() onyx-43vm5: = /usr/lib64/lustre/tests/test-framework.sh:5830:_wait_import_state() onyx-43vm5: = /usr/lib64/lustre/tests/test-framework.sh:5849:wait_import_state() onyx-43vm5: = /usr/lib64/lustre/tests/test-framework.sh:5858:wait_import_state_mount() onyx-43vm5: = rpc.sh:20:main() Looking at MDT0001 console, we see the oom-killer was invoked: 20:37:02:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0 20:37:02:mdt_out00_000 cpuset=/ mems_allowed=0 20:37:03:Pid: 17922, comm: mdt_out00_000 Not tainted 2.6.32-504.23.4.el6_lustre.ge09a9ac.x86_64 #1 20:37:03:Call Trace: 20:37:03: [<ffffffff810d4671>] ? cpuset_print_task_mems_allowed+0x91/0xb0 20:37:03: [<ffffffff81127930>] ? dump_header+0x90/0x1b0 20:37:05: [<ffffffff81127a9e>] ? check_panic_on_oom+0x4e/0x80 20:37:06: [<ffffffff8112818b>] ? out_of_memory+0x1bb/0x3c0 20:37:07: [<ffffffff81134b2f>] ? __alloc_pages_nodemask+0x89f/0x8d0 20:37:07: [<ffffffff8115e239>] ? __vmalloc_area_node+0xb9/0x190 20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:07: [<ffffffff8115e10d>] ? __vmalloc_node+0xad/0x120 20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:07: [<ffffffff8115e529>] ? vzalloc_node+0x29/0x30 20:37:07: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:08: [<ffffffffa08334cd>] ? ptlrpc_main+0x126d/0x1920 [ptlrpc] 20:37:08: [<ffffffffa0832260>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] 20:37:08: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 20:37:08: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 20:37:08: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 20:37:08: [<ffffffff8100c280>] ? child_rip+0x0/0x20 20:37:09:Mem-Info: 20:37:09:Node 0 DMA per-cpu: 20:37:09:CPU 0: hi: 0, btch: 1 usd: 0 20:37:09:CPU 1: hi: 0, btch: 1 usd: 0 20:37:09:Node 0 DMA32 per-cpu: 20:37:10:CPU 0: hi: 186, btch: 31 usd: 30 20:37:10:CPU 1: hi: 186, btch: 31 usd: 0 20:37:10:active_anon:925 inactive_anon:986 isolated_anon:0 20:37:11: active_file:134 inactive_file:165 isolated_file:0 20:37:11: unevictable:0 dirty:1 writeback:992 unstable:0 20:37:11: free:13246 slab_reclaimable:1952 slab_unreclaimable:389562 20:37:11: mapped:0 shmem:1 pagetables:1394 bounce:0 20:37:11:Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7220kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 20:37:11:lowmem_reserve[]: 0 2004 2004 2004 20:37:11:Node 0 DMA32 free:44640kB min:44720kB low:55900kB high:67080kB active_anon:3700kB inactive_anon:3944kB active_file:536kB inactive_file:656kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:4kB writeback:3968kB mapped:0kB shmem:4kB slab_reclaimable:7808kB slab_unreclaimable:1551028kB kernel_stack:2216kB pagetables:5576kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:13568 all_unreclaimable? yes 20:37:11:lowmem_reserve[]: 0 0 0 0 20:37:12:Node 0 DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8348kB 20:37:12:Node 0 DMA32: 1042*4kB 627*8kB 428*16kB 228*32kB 97*64kB 46*128kB 20*256kB 6*512kB 1*1024kB 0*2048kB 0*4096kB = 44640kB 20:37:12:1345 total pagecache pages 20:37:12:1043 pages in swap cache 20:37:12:Swap cache stats: add 5108, delete 4065, find 79/118 20:37:13:Free swap = 4109748kB 20:37:13:Total swap = 4128764kB 20:37:15:524284 pages RAM 20:37:16:43706 pages reserved 20:37:16:375 pages shared 20:37:16:91700 pages non-shared 20:37:17:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 20:37:17:[ 420] 0 420 2729 1 0 -17 -1000 udevd 20:37:17:[ 703] 0 703 2728 0 0 -17 -1000 udevd 20:37:17:[ 833] 0 833 2662 0 0 -17 -1000 udevd 20:37:18:[ 1072] 0 1072 2280 6 1 0 0 dhclient 20:37:18:[ 1125] 0 1125 6905 15 0 -17 -1000 auditd 20:37:18:[ 1155] 0 1155 63855 133 1 0 0 rsyslogd 20:37:18:[ 1185] 0 1185 4560 21 0 0 0 irqbalance 20:37:18:[ 1201] 32 1201 4744 15 0 0 0 rpcbind 20:37:19:[ 1215] 0 1215 52788 43 0 0 0 sssd 20:37:19:[ 1216] 0 1216 70940 62 0 0 0 sssd_be 20:37:19:[ 1217] 0 1217 53458 36 0 0 0 sssd_nss 20:37:19:[ 1218] 0 1218 50504 35 0 0 0 sssd_pam 20:37:19:[ 1219] 0 1219 49990 34 0 0 0 sssd_ssh 20:37:19:[ 1220] 0 1220 55084 33 0 0 0 sssd_pac 20:37:19:[ 1239] 29 1239 6357 1 1 0 0 rpc.statd 20:37:19:[ 1357] 81 1357 5878 1 1 0 0 dbus-daemon 20:37:20:[ 1374] 0 1374 47233 1 1 0 0 cupsd 20:37:20:[ 1412] 0 1412 1020 0 0 0 0 acpid 20:37:20:[ 1422] 68 1422 10484 76 0 0 0 hald 20:37:20:[ 1423] 0 1423 5100 1 1 0 0 hald-runner 20:37:20:[ 1455] 0 1455 5630 1 1 0 0 hald-addon-inpu 20:37:20:[ 1465] 68 1465 4502 1 1 0 0 hald-addon-acpi 20:37:20:[ 1485] 0 1485 169286 55 0 0 0 automount 20:37:20:[ 1533] 0 1533 26827 0 1 0 0 rpc.rquotad 20:37:21:[ 1538] 0 1538 5417 0 1 0 0 rpc.mountd 20:37:21:[ 1579] 0 1579 5773 1 0 0 0 rpc.idmapd 20:37:21:[ 1612] 496 1612 56787 71 0 0 0 munged 20:37:21:[ 1630] 0 1630 16553 0 1 -17 -1000 sshd 20:37:22:[ 1639] 0 1639 5429 18 1 0 0 xinetd 20:37:22:[ 1725] 0 1725 20734 21 0 0 0 master 20:37:22:[ 1743] 89 1743 20797 16 0 0 0 qmgr 20:37:22:[ 1750] 0 1750 29215 20 0 0 0 crond 20:37:22:[ 1763] 0 1763 5276 0 1 0 0 atd 20:37:22:[ 1791] 0 1791 16058 3 1 0 0 certmonger 20:37:22:[ 1829] 0 1829 1020 2 1 0 0 agetty 20:37:22:[ 1831] 0 1831 1016 1 0 0 0 mingetty 20:37:22:[ 1833] 0 1833 1016 1 0 0 0 mingetty 20:37:22:[ 1835] 0 1835 1016 1 0 0 0 mingetty 20:37:22:[ 1837] 0 1837 1016 1 0 0 0 mingetty 20:37:22:[ 1839] 0 1839 1016 3 0 0 0 mingetty 20:37:22:[ 1841] 0 1841 1016 3 1 0 0 mingetty 20:37:22:[ 2874] 38 2874 8205 6 0 0 0 ntpd 20:37:22:[ 900] 89 900 20754 218 0 0 0 pickup 20:37:22:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 20:37:23: 20:37:23:Pid: 17922, comm: mdt_out00_000 Not tainted 2.6.32-504.23.4.el6_lustre.ge09a9ac.x86_64 #1 20:37:23:Call Trace: 20:37:23: [<ffffffff81529bbc>] ? panic+0xa7/0x16f 20:37:23: [<ffffffff811279a1>] ? dump_header+0x101/0x1b0 20:37:23: [<ffffffff81127acc>] ? check_panic_on_oom+0x7c/0x80 20:37:23: [<ffffffff8112818b>] ? out_of_memory+0x1bb/0x3c0 20:37:24: [<ffffffff81134b2f>] ? __alloc_pages_nodemask+0x89f/0x8d0 20:37:24: [<ffffffff8115e239>] ? __vmalloc_area_node+0xb9/0x190 20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:24: [<ffffffff8115e10d>] ? __vmalloc_node+0xad/0x120 20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:24: [<ffffffff8115e529>] ? vzalloc_node+0x29/0x30 20:37:24: [<ffffffffa08306ae>] ? ptlrpc_grow_req_bufs+0x33e/0x7e0 [ptlrpc] 20:37:24: [<ffffffffa08334cd>] ? ptlrpc_main+0x126d/0x1920 [ptlrpc] 20:37:24: [<ffffffffa0832260>] ? ptlrpc_main+0x0/0x1920 [ptlrpc] 20:37:25: [<ffffffff8109e78e>] ? kthread+0x9e/0xc0 20:37:25: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 20:37:25: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 20:37:25: [<ffffffff8100c280>] ? child_rip+0x0/0x20 |