Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.9.0
-
None
-
autotest review-dne-part-2
-
3
-
9223372036854775807
Description
sanity-scrub test 5 times out in review-dne-part-2.
The last thing seen in the test_log is
CMD: onyx-34vm3 /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-MDT0003.oi_scrub | awk '/^status/ { print \$2 }' Starting client: onyx-34vm1.onyx.hpdd.intel.com: -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre CMD: onyx-34vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre CMD: onyx-34vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0 CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190 fail_val=3 fail_loc=0x190 fail_val=3 fail_loc=0x190 File: `/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 144115205322834802 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-04-14 07:21:59.000000000 -0700 Modify: 2016-04-14 07:21:20.000000000 -0700 Change: 2016-04-14 07:22:00.000000000 -0700 File: `/mnt/lustre/d5.sanity-scrub/mds2/f5.sanity-scrub800' Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 2c54f966h/743766374d Inode: 162129603832316783 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-04-14 07:22:08.000000000 -0700 Modify: 2016-04-14 07:21:25.000000000 -0700 Change: 2016-04-14 07:22:08.000000000 -0700
Most of the recent failures have the following in the MDS console log
18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:12 18:17:02:Lustre: Skipped 1 previous similar message 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190 18:17:02:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000404:0x372:0x0], rc = 0 [2] 18:17:02:Lustre: lustre-MDT0002-o: trigger OI scrub by RPC for [0x280000404:0x1:0x0], rc = 0 [1] 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 11 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 2:13 18:17:02:Lustre: Skipped 6 previous similar messages 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 21 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo) 18:17:02:Lustre: Skipped 197 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:08 18:17:02:Lustre: Skipped 12 previous similar messages 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 43 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. … 18:34:04:Lustre: lustre-MDT0002: Recovery already passed deadline 46:15, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 18:34:04:Lustre: Skipped 299 previous similar messages 18:34:04:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo) 18:34:04:Lustre: Skipped 299 previous similar messages 18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460597515/real 1460597515] req@ffff88005ca7a080 x1531531004454316/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460597517 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 99 previous similar messages 18:34:04:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0 18:34:04:mdt_out00_000 cpuset=/ mems_allowed=0 18:34:04:Pid: 24628, comm: mdt_out00_000 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1 18:34:04:Call Trace: 18:34:04: [<ffffffff810d7581>] ? cpuset_print_task_mems_allowed+0x91/0xb0 18:34:04: [<ffffffff8112ad80>] ? dump_header+0x90/0x1b0 18:34:04: [<ffffffff8112aeee>] ? check_panic_on_oom+0x4e/0x80 18:34:04: [<ffffffff8112b5db>] ? out_of_memory+0x1bb/0x3c0 18:34:04: [<ffffffff8113801c>] ? __alloc_pages_nodemask+0x93c/0x950 18:34:04: [<ffffffff81161a69>] ? __vmalloc_area_node+0xb9/0x190 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffff8116193d>] ? __vmalloc_node+0xad/0x120 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffff81161d59>] ? vzalloc_node+0x29/0x30 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffffa07c0c9d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc] 18:34:04: [<ffffffffa07bfb50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] 18:34:04: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 18:34:04: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 18:34:04: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 18:34:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20 18:34:04:Mem-Info: 18:34:04:Node 0 DMA per-cpu: 18:34:04:CPU 0: hi: 0, btch: 1 usd: 0 18:34:04:CPU 1: hi: 0, btch: 1 usd: 0 18:34:04:Node 0 DMA32 per-cpu: 18:34:04:CPU 0: hi: 186, btch: 31 usd: 0 18:34:04:CPU 1: hi: 186, btch: 31 usd: 0 18:34:04:active_anon:0 inactive_anon:10 isolated_anon:0 18:34:04: active_file:137 inactive_file:100 isolated_file:0 18:34:04: unevictable:0 dirty:7 writeback:0 unstable:0 18:34:04: free:13252 slab_reclaimable:1920 slab_unreclaimable:331018 18:34:04: mapped:0 shmem:0 pagetables:1420 bounce:0 18:34:04:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7180kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 18:34:04:lowmem_reserve[]: 0 2004 2004 2004 18:34:04:Node 0 DMA32 free:44668kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:40kB active_file:548kB inactive_file:396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:28kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7680kB slab_unreclaimable:1316892kB kernel_stack:3920kB pagetables:5680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:250 all_unreclaimable? no 18:34:04:lowmem_reserve[]: 0 0 0 0 18:34:04:Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8344kB 18:34:04:Node 0 DMA32: 1491*4kB 364*8kB 143*16kB 81*32kB 59*64kB 40*128kB 18*256kB 12*512kB 7*1024kB 2*2048kB 0*4096kB = 44668kB 18:34:04:237 total pagecache pages 18:34:04:8 pages in swap cache 18:34:04:Swap cache stats: add 21677, delete 21669, find 6474/8836 18:34:04:Free swap = 4105924kB 18:34:04:Total swap = 4128764kB 18:34:04:524284 pages RAM 18:34:04:43737 pages reserved 18:34:04:257 pages shared 18:34:04:142747 pages non-shared 18:34:04:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 18:34:04:[ 456] 0 456 2722 0 0 -17 -1000 udevd 18:34:04:[ 763] 0 763 2702 0 0 -17 -1000 udevd 18:34:04:[ 915] 0 915 2701 0 1 -17 -1000 udevd 18:34:04:[ 1175] 0 1175 2280 1 1 0 0 dhclient 18:34:04:[ 1238] 0 1238 23289 1 0 -17 -1000 auditd 18:34:04:[ 1272] 0 1272 63903 3 0 0 0 rsyslogd 18:34:04:[ 1306] 0 1306 4561 1 0 0 0 irqbalance 18:34:04:[ 1324] 32 1324 4744 1 0 0 0 rpcbind 18:34:04:[ 1340] 0 1340 52834 1 1 0 0 sssd 18:34:04:[ 1341] 0 1341 76810 1 0 0 0 sssd_be 18:34:04:[ 1342] 0 1342 54194 1 0 0 0 sssd_nss 18:34:04:[ 1343] 0 1343 50551 1 1 0 0 sssd_pam 18:34:04:[ 1344] 0 1344 51904 1 0 0 0 sssd_ssh 18:34:04:[ 1345] 0 1345 56336 1 1 0 0 sssd_pac 18:34:04:[ 1367] 29 1367 6357 1 1 0 0 rpc.statd 18:34:04:[ 1401] 81 1401 5878 1 0 0 0 dbus-daemon 18:34:04:[ 1422] 0 1422 47238 1 1 0 0 cupsd 18:34:04:[ 1466] 0 1466 1020 0 1 0 0 acpid 18:34:04:[ 1478] 68 1478 10495 1 0 0 0 hald 18:34:04:[ 1479] 0 1479 5100 1 1 0 0 hald-runner 18:34:04:[ 1511] 0 1511 5630 1 1 0 0 hald-addon-inpu 18:34:04:[ 1518] 68 1518 4502 1 1 0 0 hald-addon-acpi 18:34:04:[ 1545] 0 1545 169286 1 1 0 0 automount 18:34:04:[ 1599] 0 1599 26827 0 0 0 0 rpc.rquotad 18:34:04:[ 1604] 0 1604 5417 0 0 0 0 rpc.mountd 18:34:04:[ 1649] 0 1649 5774 1 0 0 0 rpc.idmapd 18:34:04:[ 1686] 496 1686 56788 1 1 0 0 munged 18:34:04:[ 8281] 0 8281 5429 1 1 0 0 xinetd 18:34:04:[ 8373] 0 8373 20741 1 0 0 0 master 18:34:04:[ 8384] 89 8384 20804 1 0 0 0 qmgr 18:34:04:[ 8402] 0 8402 29216 1 0 0 0 crond 18:34:04:[ 8420] 0 8420 5276 0 1 0 0 atd 18:34:04:[ 8452] 0 8452 16119 1 0 0 0 certmonger 18:34:04:[ 8488] 0 8488 1020 1 0 0 0 agetty 18:34:04:[ 8492] 0 8492 1016 1 1 0 0 mingetty 18:34:04:[ 8496] 0 8496 1016 1 1 0 0 mingetty 18:34:04:[ 8502] 0 8502 1016 1 0 0 0 mingetty 18:34:04:[ 8506] 0 8506 1016 1 0 0 0 mingetty 18:34:04:[ 8508] 0 8508 1016 1 0 0 0 mingetty 18:34:04:[ 8511] 0 8511 1016 1 0 0 0 mingetty 18:34:04:[ 9627] 0 9627 16557 0 1 -17 -1000 sshd 18:34:04:[10443] 38 10443 8207 1 0 0 0 ntpd 18:34:04:[19865] 89 19865 20761 1 0 0 0 pickup 18:34:04:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 18:34:04:
Some recent failure with logs are at
https://testing.hpdd.intel.com/test_sets/d08cd624-02b3-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f4316e66-0284-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7ea41ce2-023f-11e6-83bc-5254006e85c2
Attachments
Issue Links
- is related to
-
LU-7836 MDSes crashed with oom-killer
- Resolved