Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.9.0
Labels:
None
Environment:
autotest review-dne-part-2

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity-scrub test 5 times out in review-dne-part-2.

The last thing seen in the test_log is

CMD: onyx-34vm3 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0003.oi_scrub |
			awk '/^status/ { print \$2 }'
Starting client: onyx-34vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
fail_val=3
fail_loc=0x190
fail_val=3
fail_loc=0x190
  File: `/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 144115205322834802  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-04-14 07:21:59.000000000 -0700
Modify: 2016-04-14 07:21:20.000000000 -0700
Change: 2016-04-14 07:22:00.000000000 -0700
  File: `/mnt/lustre/d5.sanity-scrub/mds2/f5.sanity-scrub800'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 162129603832316783  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-04-14 07:22:08.000000000 -0700
Modify: 2016-04-14 07:21:25.000000000 -0700
Change: 2016-04-14 07:22:08.000000000 -0700

Most of the recent failures have the following in the MDS console log

18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:12
18:17:02:Lustre: Skipped 1 previous similar message
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
18:17:02:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000404:0x372:0x0], rc = 0 [2]
18:17:02:Lustre: lustre-MDT0002-o: trigger OI scrub by RPC for [0x280000404:0x1:0x0], rc = 0 [1]
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 11 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 2:13
18:17:02:Lustre: Skipped 6 previous similar messages
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 21 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
18:17:02:Lustre: Skipped 197 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:08
18:17:02:Lustre: Skipped 12 previous similar messages
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 43 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
…
18:34:04:Lustre: lustre-MDT0002: Recovery already passed deadline 46:15, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
18:34:04:Lustre: Skipped 299 previous similar messages
18:34:04:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
18:34:04:Lustre: Skipped 299 previous similar messages
18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460597515/real 1460597515]  req@ffff88005ca7a080 x1531531004454316/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460597517 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 99 previous similar messages
18:34:04:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0
18:34:04:mdt_out00_000 cpuset=/ mems_allowed=0
18:34:04:Pid: 24628, comm: mdt_out00_000 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1
18:34:04:Call Trace:
18:34:04: [<ffffffff810d7581>] ? cpuset_print_task_mems_allowed+0x91/0xb0
18:34:04: [<ffffffff8112ad80>] ? dump_header+0x90/0x1b0
18:34:04: [<ffffffff8112aeee>] ? check_panic_on_oom+0x4e/0x80
18:34:04: [<ffffffff8112b5db>] ? out_of_memory+0x1bb/0x3c0
18:34:04: [<ffffffff8113801c>] ? __alloc_pages_nodemask+0x93c/0x950
18:34:04: [<ffffffff81161a69>] ? __vmalloc_area_node+0xb9/0x190
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffff8116193d>] ? __vmalloc_node+0xad/0x120
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffff81161d59>] ? vzalloc_node+0x29/0x30
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffffa07c0c9d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc]
18:34:04: [<ffffffffa07bfb50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
18:34:04: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
18:34:04: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
18:34:04: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
18:34:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20
18:34:04:Mem-Info:
18:34:04:Node 0 DMA per-cpu:
18:34:04:CPU    0: hi:    0, btch:   1 usd:   0
18:34:04:CPU    1: hi:    0, btch:   1 usd:   0
18:34:04:Node 0 DMA32 per-cpu:
18:34:04:CPU    0: hi:  186, btch:  31 usd:   0
18:34:04:CPU    1: hi:  186, btch:  31 usd:   0
18:34:04:active_anon:0 inactive_anon:10 isolated_anon:0
18:34:04: active_file:137 inactive_file:100 isolated_file:0
18:34:04: unevictable:0 dirty:7 writeback:0 unstable:0
18:34:04: free:13252 slab_reclaimable:1920 slab_unreclaimable:331018
18:34:04: mapped:0 shmem:0 pagetables:1420 bounce:0
18:34:04:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7180kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
18:34:04:lowmem_reserve[]: 0 2004 2004 2004
18:34:04:Node 0 DMA32 free:44668kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:40kB active_file:548kB inactive_file:396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:28kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7680kB slab_unreclaimable:1316892kB kernel_stack:3920kB pagetables:5680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:250 all_unreclaimable? no
18:34:04:lowmem_reserve[]: 0 0 0 0
18:34:04:Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8344kB
18:34:04:Node 0 DMA32: 1491*4kB 364*8kB 143*16kB 81*32kB 59*64kB 40*128kB 18*256kB 12*512kB 7*1024kB 2*2048kB 0*4096kB = 44668kB
18:34:04:237 total pagecache pages
18:34:04:8 pages in swap cache
18:34:04:Swap cache stats: add 21677, delete 21669, find 6474/8836
18:34:04:Free swap  = 4105924kB
18:34:04:Total swap = 4128764kB
18:34:04:524284 pages RAM
18:34:04:43737 pages reserved
18:34:04:257 pages shared
18:34:04:142747 pages non-shared
18:34:04:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
18:34:04:[  456]     0   456     2722        0   0     -17         -1000 udevd
18:34:04:[  763]     0   763     2702        0   0     -17         -1000 udevd
18:34:04:[  915]     0   915     2701        0   1     -17         -1000 udevd
18:34:04:[ 1175]     0  1175     2280        1   1       0             0 dhclient
18:34:04:[ 1238]     0  1238    23289        1   0     -17         -1000 auditd
18:34:04:[ 1272]     0  1272    63903        3   0       0             0 rsyslogd
18:34:04:[ 1306]     0  1306     4561        1   0       0             0 irqbalance
18:34:04:[ 1324]    32  1324     4744        1   0       0             0 rpcbind
18:34:04:[ 1340]     0  1340    52834        1   1       0             0 sssd
18:34:04:[ 1341]     0  1341    76810        1   0       0             0 sssd_be
18:34:04:[ 1342]     0  1342    54194        1   0       0             0 sssd_nss
18:34:04:[ 1343]     0  1343    50551        1   1       0             0 sssd_pam
18:34:04:[ 1344]     0  1344    51904        1   0       0             0 sssd_ssh
18:34:04:[ 1345]     0  1345    56336        1   1       0             0 sssd_pac
18:34:04:[ 1367]    29  1367     6357        1   1       0             0 rpc.statd
18:34:04:[ 1401]    81  1401     5878        1   0       0             0 dbus-daemon
18:34:04:[ 1422]     0  1422    47238        1   1       0             0 cupsd
18:34:04:[ 1466]     0  1466     1020        0   1       0             0 acpid
18:34:04:[ 1478]    68  1478    10495        1   0       0             0 hald
18:34:04:[ 1479]     0  1479     5100        1   1       0             0 hald-runner
18:34:04:[ 1511]     0  1511     5630        1   1       0             0 hald-addon-inpu
18:34:04:[ 1518]    68  1518     4502        1   1       0             0 hald-addon-acpi
18:34:04:[ 1545]     0  1545   169286        1   1       0             0 automount
18:34:04:[ 1599]     0  1599    26827        0   0       0             0 rpc.rquotad
18:34:04:[ 1604]     0  1604     5417        0   0       0             0 rpc.mountd
18:34:04:[ 1649]     0  1649     5774        1   0       0             0 rpc.idmapd
18:34:04:[ 1686]   496  1686    56788        1   1       0             0 munged
18:34:04:[ 8281]     0  8281     5429        1   1       0             0 xinetd
18:34:04:[ 8373]     0  8373    20741        1   0       0             0 master
18:34:04:[ 8384]    89  8384    20804        1   0       0             0 qmgr
18:34:04:[ 8402]     0  8402    29216        1   0       0             0 crond
18:34:04:[ 8420]     0  8420     5276        0   1       0             0 atd
18:34:04:[ 8452]     0  8452    16119        1   0       0             0 certmonger
18:34:04:[ 8488]     0  8488     1020        1   0       0             0 agetty
18:34:04:[ 8492]     0  8492     1016        1   1       0             0 mingetty
18:34:04:[ 8496]     0  8496     1016        1   1       0             0 mingetty
18:34:04:[ 8502]     0  8502     1016        1   0       0             0 mingetty
18:34:04:[ 8506]     0  8506     1016        1   0       0             0 mingetty
18:34:04:[ 8508]     0  8508     1016        1   0       0             0 mingetty
18:34:04:[ 8511]     0  8511     1016        1   0       0             0 mingetty
18:34:04:[ 9627]     0  9627    16557        0   1     -17         -1000 sshd
18:34:04:[10443]    38 10443     8207        1   0       0             0 ntpd
18:34:04:[19865]    89 19865    20761        1   0       0             0 pickup
18:34:04:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
18:34:04:

Some recent failure with logs are at
https://testing.hpdd.intel.com/test_sets/d08cd624-02b3-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f4316e66-0284-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7ea41ce2-023f-11e6-83bc-5254006e85c2

Attachments

Issue Links

is related to

LU-7836 MDSes crashed with oom-killer

Resolved

sanity-scrub test_5 oom-killer and times out

Details

Description

Attachments

Issue Links

Activity

People

Dates