Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8070

sanity-scrub test_5 oom-killer and times out

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.9.0
    • None
    • autotest review-dne-part-2
    • 3
    • 9223372036854775807

    Description

      sanity-scrub test 5 times out in review-dne-part-2.

      The last thing seen in the test_log is

      CMD: onyx-34vm3 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0003.oi_scrub |
      			awk '/^status/ { print \$2 }'
      Starting client: onyx-34vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
      CMD: onyx-34vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
      CMD: onyx-34vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
      CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
      CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
      fail_val=3
      fail_loc=0x190
      fail_val=3
      fail_loc=0x190
        File: `/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 144115205322834802  Links: 1
      Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2016-04-14 07:21:59.000000000 -0700
      Modify: 2016-04-14 07:21:20.000000000 -0700
      Change: 2016-04-14 07:22:00.000000000 -0700
        File: `/mnt/lustre/d5.sanity-scrub/mds2/f5.sanity-scrub800'
        Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
      Device: 2c54f966h/743766374d	Inode: 162129603832316783  Links: 1
      Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2016-04-14 07:22:08.000000000 -0700
      Modify: 2016-04-14 07:21:25.000000000 -0700
      Change: 2016-04-14 07:22:08.000000000 -0700
      

      Most of the recent failures have the following in the MDS console log

      18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:12
      18:17:02:Lustre: Skipped 1 previous similar message
      18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
      18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
      18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
      18:17:02:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000404:0x372:0x0], rc = 0 [2]
      18:17:02:Lustre: lustre-MDT0002-o: trigger OI scrub by RPC for [0x280000404:0x1:0x0], rc = 0 [1]
      18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
      18:17:02:Lustre: Skipped 11 previous similar messages
      18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 2:13
      18:17:02:Lustre: Skipped 6 previous similar messages
      18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
      18:17:02:Lustre: Skipped 21 previous similar messages
      18:17:02:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
      18:17:02:Lustre: Skipped 197 previous similar messages
      18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:08
      18:17:02:Lustre: Skipped 12 previous similar messages
      18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
      18:17:02:Lustre: Skipped 43 previous similar messages
      18:17:02:Lustre: lustre-MDT0002: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      …
      18:34:04:Lustre: lustre-MDT0002: Recovery already passed deadline 46:15, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
      18:34:04:Lustre: Skipped 299 previous similar messages
      18:34:04:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
      18:34:04:Lustre: Skipped 299 previous similar messages
      18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460597515/real 1460597515]  req@ffff88005ca7a080 x1531531004454316/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460597517 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
      18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 99 previous similar messages
      18:34:04:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0
      18:34:04:mdt_out00_000 cpuset=/ mems_allowed=0
      18:34:04:Pid: 24628, comm: mdt_out00_000 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1
      18:34:04:Call Trace:
      18:34:04: [<ffffffff810d7581>] ? cpuset_print_task_mems_allowed+0x91/0xb0
      18:34:04: [<ffffffff8112ad80>] ? dump_header+0x90/0x1b0
      18:34:04: [<ffffffff8112aeee>] ? check_panic_on_oom+0x4e/0x80
      18:34:04: [<ffffffff8112b5db>] ? out_of_memory+0x1bb/0x3c0
      18:34:04: [<ffffffff8113801c>] ? __alloc_pages_nodemask+0x93c/0x950
      18:34:04: [<ffffffff81161a69>] ? __vmalloc_area_node+0xb9/0x190
      18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      18:34:04: [<ffffffff8116193d>] ? __vmalloc_node+0xad/0x120
      18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      18:34:04: [<ffffffff81161d59>] ? vzalloc_node+0x29/0x30
      18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
      18:34:04: [<ffffffffa07c0c9d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc]
      18:34:04: [<ffffffffa07bfb50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
      18:34:04: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
      18:34:04: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      18:34:04: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
      18:34:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      18:34:04:Mem-Info:
      18:34:04:Node 0 DMA per-cpu:
      18:34:04:CPU    0: hi:    0, btch:   1 usd:   0
      18:34:04:CPU    1: hi:    0, btch:   1 usd:   0
      18:34:04:Node 0 DMA32 per-cpu:
      18:34:04:CPU    0: hi:  186, btch:  31 usd:   0
      18:34:04:CPU    1: hi:  186, btch:  31 usd:   0
      18:34:04:active_anon:0 inactive_anon:10 isolated_anon:0
      18:34:04: active_file:137 inactive_file:100 isolated_file:0
      18:34:04: unevictable:0 dirty:7 writeback:0 unstable:0
      18:34:04: free:13252 slab_reclaimable:1920 slab_unreclaimable:331018
      18:34:04: mapped:0 shmem:0 pagetables:1420 bounce:0
      18:34:04:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7180kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
      18:34:04:lowmem_reserve[]: 0 2004 2004 2004
      18:34:04:Node 0 DMA32 free:44668kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:40kB active_file:548kB inactive_file:396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:28kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7680kB slab_unreclaimable:1316892kB kernel_stack:3920kB pagetables:5680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:250 all_unreclaimable? no
      18:34:04:lowmem_reserve[]: 0 0 0 0
      18:34:04:Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8344kB
      18:34:04:Node 0 DMA32: 1491*4kB 364*8kB 143*16kB 81*32kB 59*64kB 40*128kB 18*256kB 12*512kB 7*1024kB 2*2048kB 0*4096kB = 44668kB
      18:34:04:237 total pagecache pages
      18:34:04:8 pages in swap cache
      18:34:04:Swap cache stats: add 21677, delete 21669, find 6474/8836
      18:34:04:Free swap  = 4105924kB
      18:34:04:Total swap = 4128764kB
      18:34:04:524284 pages RAM
      18:34:04:43737 pages reserved
      18:34:04:257 pages shared
      18:34:04:142747 pages non-shared
      18:34:04:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      18:34:04:[  456]     0   456     2722        0   0     -17         -1000 udevd
      18:34:04:[  763]     0   763     2702        0   0     -17         -1000 udevd
      18:34:04:[  915]     0   915     2701        0   1     -17         -1000 udevd
      18:34:04:[ 1175]     0  1175     2280        1   1       0             0 dhclient
      18:34:04:[ 1238]     0  1238    23289        1   0     -17         -1000 auditd
      18:34:04:[ 1272]     0  1272    63903        3   0       0             0 rsyslogd
      18:34:04:[ 1306]     0  1306     4561        1   0       0             0 irqbalance
      18:34:04:[ 1324]    32  1324     4744        1   0       0             0 rpcbind
      18:34:04:[ 1340]     0  1340    52834        1   1       0             0 sssd
      18:34:04:[ 1341]     0  1341    76810        1   0       0             0 sssd_be
      18:34:04:[ 1342]     0  1342    54194        1   0       0             0 sssd_nss
      18:34:04:[ 1343]     0  1343    50551        1   1       0             0 sssd_pam
      18:34:04:[ 1344]     0  1344    51904        1   0       0             0 sssd_ssh
      18:34:04:[ 1345]     0  1345    56336        1   1       0             0 sssd_pac
      18:34:04:[ 1367]    29  1367     6357        1   1       0             0 rpc.statd
      18:34:04:[ 1401]    81  1401     5878        1   0       0             0 dbus-daemon
      18:34:04:[ 1422]     0  1422    47238        1   1       0             0 cupsd
      18:34:04:[ 1466]     0  1466     1020        0   1       0             0 acpid
      18:34:04:[ 1478]    68  1478    10495        1   0       0             0 hald
      18:34:04:[ 1479]     0  1479     5100        1   1       0             0 hald-runner
      18:34:04:[ 1511]     0  1511     5630        1   1       0             0 hald-addon-inpu
      18:34:04:[ 1518]    68  1518     4502        1   1       0             0 hald-addon-acpi
      18:34:04:[ 1545]     0  1545   169286        1   1       0             0 automount
      18:34:04:[ 1599]     0  1599    26827        0   0       0             0 rpc.rquotad
      18:34:04:[ 1604]     0  1604     5417        0   0       0             0 rpc.mountd
      18:34:04:[ 1649]     0  1649     5774        1   0       0             0 rpc.idmapd
      18:34:04:[ 1686]   496  1686    56788        1   1       0             0 munged
      18:34:04:[ 8281]     0  8281     5429        1   1       0             0 xinetd
      18:34:04:[ 8373]     0  8373    20741        1   0       0             0 master
      18:34:04:[ 8384]    89  8384    20804        1   0       0             0 qmgr
      18:34:04:[ 8402]     0  8402    29216        1   0       0             0 crond
      18:34:04:[ 8420]     0  8420     5276        0   1       0             0 atd
      18:34:04:[ 8452]     0  8452    16119        1   0       0             0 certmonger
      18:34:04:[ 8488]     0  8488     1020        1   0       0             0 agetty
      18:34:04:[ 8492]     0  8492     1016        1   1       0             0 mingetty
      18:34:04:[ 8496]     0  8496     1016        1   1       0             0 mingetty
      18:34:04:[ 8502]     0  8502     1016        1   0       0             0 mingetty
      18:34:04:[ 8506]     0  8506     1016        1   0       0             0 mingetty
      18:34:04:[ 8508]     0  8508     1016        1   0       0             0 mingetty
      18:34:04:[ 8511]     0  8511     1016        1   0       0             0 mingetty
      18:34:04:[ 9627]     0  9627    16557        0   1     -17         -1000 sshd
      18:34:04:[10443]    38 10443     8207        1   0       0             0 ntpd
      18:34:04:[19865]    89 19865    20761        1   0       0             0 pickup
      18:34:04:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
      18:34:04:
      

      Some recent failure with logs are at
      https://testing.hpdd.intel.com/test_sets/d08cd624-02b3-11e6-9e5d-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/f4316e66-0284-11e6-9e5d-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/7ea41ce2-023f-11e6-83bc-5254006e85c2

      Attachments

        Issue Links

          Activity

            [LU-8070] sanity-scrub test_5 oom-killer and times out

            It is the duplication of LU-7836.

            yong.fan nasf (Inactive) added a comment - It is the duplication of LU-7836 .

            According to the original bug description, there were many "ping" RPC timeout on the peer MDT:

            18:17:02:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460596315/real 1460596315]  req@ffff88007ac75380 x1531531004437356/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460596317 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
            

            That matches the scenario described in LU-7836: there were too much duplicate final ping RPCs on the MDT during the recovery, as to RAM exhausted. So the patch http://review.whamcloud.com/#/c/19693/ should have already fixed the trouble.

            I would wait for Frank's feedback on the LU-7836 before closing it.

            yong.fan nasf (Inactive) added a comment - According to the original bug description, there were many "ping" RPC timeout on the peer MDT: 18:17:02:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460596315/real 1460596315] req@ffff88007ac75380 x1531531004437356/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460596317 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 That matches the scenario described in LU-7836 : there were too much duplicate final ping RPCs on the MDT during the recovery, as to RAM exhausted. So the patch http://review.whamcloud.com/#/c/19693/ should have already fixed the trouble. I would wait for Frank's feedback on the LU-7836 before closing it.

            I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck.

            Wangdi, you have set ptlrpc_request::rq_allow_replay unconditional for osp_remote_sync(), (although I do not think it is right under some cases). Then even though it was in recovery, the OUT RPC for update logs still can be sent out. Since it did not, is it possible anything wrong for the OSP import status? Or we need more work to adjust ptlrpc_import_delay_req() behaviour?

            yong.fan nasf (Inactive) added a comment - I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck. Wangdi, you have set ptlrpc_request::rq_allow_replay unconditional for osp_remote_sync(), (although I do not think it is right under some cases). Then even though it was in recovery, the OUT RPC for update logs still can be sent out. Since it did not, is it possible anything wrong for the OSP import status? Or we need more work to adjust ptlrpc_import_delay_req() behaviour?
            yong.fan nasf (Inactive) added a comment - - edited

            The OI scrub has nothing related with connection operation, it only affects FID based operation (if no name lookup before that): if the OI mapping for the target FID is NOT ready, it will return -115 (-EINPROGRESS) to the caller. Generally, the caller or related RPC sponsor needs to retry related request or RPC some time later under such case. In this test, the OI scrub is expected. On the other hand, I wonder why the osp_remote_sync() was not sent out or replied? Since OI scrub is only local operation, it will not affect the ptlrpc layer behaviour.

            yong.fan nasf (Inactive) added a comment - - edited The OI scrub has nothing related with connection operation, it only affects FID based operation (if no name lookup before that): if the OI mapping for the target FID is NOT ready, it will return -115 (-EINPROGRESS) to the caller. Generally, the caller or related RPC sponsor needs to retry related request or RPC some time later under such case. In this test, the OI scrub is expected. On the other hand, I wonder why the osp_remote_sync() was not sent out or replied? Since OI scrub is only local operation, it will not affect the ptlrpc layer behaviour.
            di.wang Di Wang added a comment - - edited

            Hmm, it looks like MDT0 is doing scrub

            [16482.903022] OI_scrub        S 000000000000d918     0 15926      2 0x00000080
            [16482.903022]  ffff88004f7b7bb0 0000000000000046 ffff880053a45080 ffff88004f7b7fd8
            [16482.903022]  ffff88004f7b7fd8 ffff88004f7b7fd8 ffff880053a45080 ffffffff81daba00
            [16482.903022]  ffff88004f7b7be0 0000000100f6f3a9 ffffffff81daba00 000000000000d918
            [16482.903022] Call Trace:
            [16482.903022]  [<ffffffff8163ba29>] schedule+0x29/0x70
            [16482.903022]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
            [16482.903022]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
            [16482.903022]  [<ffffffffa0c45852>] osd_scrub_next+0x2a2/0x3d0 [osd_ldiskfs]
            [16482.903022]  [<ffffffffa0c497c5>] ? osd_scrub_exec+0x65/0x600 [osd_ldiskfs]
            [16482.903022]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
            [16482.903022]  [<ffffffffa0c4ae20>] osd_inode_iteration+0x310/0xd70 [osd_ldiskfs]
            [16482.903022]  [<ffffffffa0c49760>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
            [16482.903022]  [<ffffffffa0c455b0>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]
            [16482.903022]  [<ffffffffa0c4c1e0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
            [16482.903022]  [<ffffffffa0c4b880>] ? osd_inode_iteration+0xd70/0xd70 [osd_ldiskfs]
            [16482.903022]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [16482.903022]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [16482.903022]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
            [16482.903022]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            

            FanYong, Will this stop MDT0 accepting any connection?

            di.wang Di Wang added a comment - - edited Hmm, it looks like MDT0 is doing scrub [16482.903022] OI_scrub S 000000000000d918 0 15926 2 0x00000080 [16482.903022] ffff88004f7b7bb0 0000000000000046 ffff880053a45080 ffff88004f7b7fd8 [16482.903022] ffff88004f7b7fd8 ffff88004f7b7fd8 ffff880053a45080 ffffffff81daba00 [16482.903022] ffff88004f7b7be0 0000000100f6f3a9 ffffffff81daba00 000000000000d918 [16482.903022] Call Trace: [16482.903022] [<ffffffff8163ba29>] schedule+0x29/0x70 [16482.903022] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [16482.903022] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [16482.903022] [<ffffffffa0c45852>] osd_scrub_next+0x2a2/0x3d0 [osd_ldiskfs] [16482.903022] [<ffffffffa0c497c5>] ? osd_scrub_exec+0x65/0x600 [osd_ldiskfs] [16482.903022] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [16482.903022] [<ffffffffa0c4ae20>] osd_inode_iteration+0x310/0xd70 [osd_ldiskfs] [16482.903022] [<ffffffffa0c49760>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] [16482.903022] [<ffffffffa0c455b0>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs] [16482.903022] [<ffffffffa0c4c1e0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] [16482.903022] [<ffffffffa0c4b880>] ? osd_inode_iteration+0xd70/0xd70 [osd_ldiskfs] [16482.903022] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [16482.903022] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [16482.903022] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [16482.903022] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 FanYong, Will this stop MDT0 accepting any connection?
            di.wang Di Wang added a comment -

            I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck.

            [ 4947.935836] lod0000_rec0001 S 000000000000d918     0 19136      2 0x00000080
            [ 4947.935836]  ffff88004caff750 0000000000000046 ffff880035ee6780 ffff88004cafffd8
            [ 4947.935836]  ffff88004cafffd8 ffff88004cafffd8 ffff880035ee6780 ffff88007c0ac000
            [ 4947.935836]  ffff88004caff780 000000010046ebf8 ffff88007c0ac000 000000000000d918
            [ 4947.935836] Call Trace:
            [ 4947.935836]  [<ffffffff8163ba29>] schedule+0x29/0x70
            [ 4947.935836]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
            [ 4947.935836]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
            [ 4947.935836]  [<ffffffffa09ff8e0>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc]
            [ 4947.935836]  [<ffffffffa0a09670>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc]
            [ 4947.935836]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
            [ 4947.935836]  [<ffffffffa0a09b2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            [ 4947.935836]  [<ffffffffa0f77e99>] osp_remote_sync+0xd9/0x1c0 [osp]
            [ 4947.935836]  [<ffffffffa0f5ffdc>] osp_attr_get+0x44c/0x6e0 [osp]
            [ 4947.935836]  [<ffffffffa0f5b3b2>] osp_object_init+0x152/0x2c0 [osp]
            [ 4947.935836]  [<ffffffffa07f13ef>] lu_object_alloc+0xdf/0x310 [obdclass]
            [ 4947.935836]  [<ffffffffa07f17ec>] lu_object_find_try+0x16c/0x2b0 [obdclass]
            [ 4947.935836]  [<ffffffffa07f19dc>] lu_object_find_at+0xac/0xe0 [obdclass]
            [ 4947.935836]  [<ffffffffa06be498>] ? libcfs_debug_vmsg2+0x6f8/0xb60 [libcfs]
            [ 4947.935836]  [<ffffffffa07f2d68>] dt_locate_at+0x18/0xb0 [obdclass]
            [ 4947.935836]  [<ffffffffa07b3368>] llog_osd_open+0xf8/0xdf0 [obdclass]
            [ 4947.935836]  [<ffffffffa07a283a>] llog_open+0x13a/0x3b0 [obdclass]
            [ 4947.935836]  [<ffffffffa07aa1dd>] llog_cat_id2handle+0x28d/0x660 [obdclass]
            [ 4947.935836]  [<ffffffffa06be957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
            [ 4947.935836]  [<ffffffffa07ab989>] llog_cat_process_cb+0x139/0x600 [obdclass]
            [ 4947.935836]  [<ffffffffa07a59f5>] llog_process_thread+0x5a5/0x10c0 [obdclass]
            [ 4947.935836]  [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass]
            [ 4947.935836]  [<ffffffffa07a65cc>] llog_process_or_fork+0xbc/0x430 [obdclass]
            [ 4947.935836]  [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass]
            [ 4947.935836]  [<ffffffffa07aacd1>] llog_cat_process_or_fork+0x1b1/0x2f0 [obdclass]
            [ 4947.935836]  [<ffffffffa0ec50ca>] ? lod_sub_prep_llog+0x24a/0x7ec [lod]
            [ 4947.935836]  [<ffffffffa0e9c670>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
            [ 4947.935836]  [<ffffffffa07aae3e>] llog_cat_process+0x2e/0x30 [obdclass]
            [ 4947.935836]  [<ffffffffa0e992b6>] lod_sub_recovery_thread+0x3a6/0xaf0 [lod]
            [ 4947.935836]  [<ffffffffa0e98f10>] ? lod_obd_set_info_async+0xa90/0xa90 [lod]
            [ 4947.935836]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [ 4947.935836]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
            [ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [ 4947.935836] tgt_recover_0   D 000000000000d918     0 19139      2 0x00000080
            [ 4947.935836]  ffff88004d4a7cc0 0000000000000046 ffff880035ee7300 ffff88004d4a7fd8
            [ 4947.935836]  ffff88004d4a7fd8 ffff88004d4a7fd8 ffff880035ee7300 ffff88007c0ac000
            [ 4947.935836]  ffff88004d4a7cf0 0000000100471ac7 ffff88007c0ac000 000000000000d918
            [ 4947.935836] Call Trace:
            [ 4947.935836]  [<ffffffff8163ba29>] schedule+0x29/0x70
            [ 4947.935836]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
            [ 4947.935836]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
            [ 4947.935836]  [<ffffffffa09d07e0>] ? handle_recovery_req+0x270/0x270 [ptlrpc]
            [ 4947.935836]  [<ffffffffa09d2198>] target_recovery_overseer+0xd8/0x650 [ptlrpc]
            [ 4947.935836]  [<ffffffffa09cecc0>] ? ldlm_dump_ns_write+0x90/0x90 [ptlrpc]
            [ 4947.935836]  [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
            [ 4947.935836]  [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
            [ 4947.935836]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
            [ 4947.935836]  [<ffffffffa09daf71>] target_recovery_thread+0x5d1/0x1360 [ptlrpc]
            [ 4947.935836]  [<ffffffff8163b3d8>] ? __schedule+0x2d8/0x900
            [ 4947.935836]  [<ffffffffa09da9a0>] ? replay_request_or_update.isra.20+0x8c0/0x8c0 [ptlrpc]
            [ 4947.935836]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
            [ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            [ 4947.935836]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
            [ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
            
            di.wang Di Wang added a comment - I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck. [ 4947.935836] lod0000_rec0001 S 000000000000d918 0 19136 2 0x00000080 [ 4947.935836] ffff88004caff750 0000000000000046 ffff880035ee6780 ffff88004cafffd8 [ 4947.935836] ffff88004cafffd8 ffff88004cafffd8 ffff880035ee6780 ffff88007c0ac000 [ 4947.935836] ffff88004caff780 000000010046ebf8 ffff88007c0ac000 000000000000d918 [ 4947.935836] Call Trace: [ 4947.935836] [<ffffffff8163ba29>] schedule+0x29/0x70 [ 4947.935836] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [ 4947.935836] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [ 4947.935836] [<ffffffffa09ff8e0>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc] [ 4947.935836] [<ffffffffa0a09670>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc] [ 4947.935836] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [ 4947.935836] [<ffffffffa0a09b2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [ 4947.935836] [<ffffffffa0f77e99>] osp_remote_sync+0xd9/0x1c0 [osp] [ 4947.935836] [<ffffffffa0f5ffdc>] osp_attr_get+0x44c/0x6e0 [osp] [ 4947.935836] [<ffffffffa0f5b3b2>] osp_object_init+0x152/0x2c0 [osp] [ 4947.935836] [<ffffffffa07f13ef>] lu_object_alloc+0xdf/0x310 [obdclass] [ 4947.935836] [<ffffffffa07f17ec>] lu_object_find_try+0x16c/0x2b0 [obdclass] [ 4947.935836] [<ffffffffa07f19dc>] lu_object_find_at+0xac/0xe0 [obdclass] [ 4947.935836] [<ffffffffa06be498>] ? libcfs_debug_vmsg2+0x6f8/0xb60 [libcfs] [ 4947.935836] [<ffffffffa07f2d68>] dt_locate_at+0x18/0xb0 [obdclass] [ 4947.935836] [<ffffffffa07b3368>] llog_osd_open+0xf8/0xdf0 [obdclass] [ 4947.935836] [<ffffffffa07a283a>] llog_open+0x13a/0x3b0 [obdclass] [ 4947.935836] [<ffffffffa07aa1dd>] llog_cat_id2handle+0x28d/0x660 [obdclass] [ 4947.935836] [<ffffffffa06be957>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4947.935836] [<ffffffffa07ab989>] llog_cat_process_cb+0x139/0x600 [obdclass] [ 4947.935836] [<ffffffffa07a59f5>] llog_process_thread+0x5a5/0x10c0 [obdclass] [ 4947.935836] [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass] [ 4947.935836] [<ffffffffa07a65cc>] llog_process_or_fork+0xbc/0x430 [obdclass] [ 4947.935836] [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass] [ 4947.935836] [<ffffffffa07aacd1>] llog_cat_process_or_fork+0x1b1/0x2f0 [obdclass] [ 4947.935836] [<ffffffffa0ec50ca>] ? lod_sub_prep_llog+0x24a/0x7ec [lod] [ 4947.935836] [<ffffffffa0e9c670>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] [ 4947.935836] [<ffffffffa07aae3e>] llog_cat_process+0x2e/0x30 [obdclass] [ 4947.935836] [<ffffffffa0e992b6>] lod_sub_recovery_thread+0x3a6/0xaf0 [lod] [ 4947.935836] [<ffffffffa0e98f10>] ? lod_obd_set_info_async+0xa90/0xa90 [lod] [ 4947.935836] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] tgt_recover_0 D 000000000000d918 0 19139 2 0x00000080 [ 4947.935836] ffff88004d4a7cc0 0000000000000046 ffff880035ee7300 ffff88004d4a7fd8 [ 4947.935836] ffff88004d4a7fd8 ffff88004d4a7fd8 ffff880035ee7300 ffff88007c0ac000 [ 4947.935836] ffff88004d4a7cf0 0000000100471ac7 ffff88007c0ac000 000000000000d918 [ 4947.935836] Call Trace: [ 4947.935836] [<ffffffff8163ba29>] schedule+0x29/0x70 [ 4947.935836] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [ 4947.935836] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [ 4947.935836] [<ffffffffa09d07e0>] ? handle_recovery_req+0x270/0x270 [ptlrpc] [ 4947.935836] [<ffffffffa09d2198>] target_recovery_overseer+0xd8/0x650 [ptlrpc] [ 4947.935836] [<ffffffffa09cecc0>] ? ldlm_dump_ns_write+0x90/0x90 [ptlrpc] [ 4947.935836] [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 [ 4947.935836] [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 [ 4947.935836] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [ 4947.935836] [<ffffffffa09daf71>] target_recovery_thread+0x5d1/0x1360 [ptlrpc] [ 4947.935836] [<ffffffff8163b3d8>] ? __schedule+0x2d8/0x900 [ 4947.935836] [<ffffffffa09da9a0>] ? replay_request_or_update.isra.20+0x8c0/0x8c0 [ptlrpc] [ 4947.935836] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

            This looks like another failure instance of LU-7836. Di has made a patch such ticket http://review.whamcloud.com/#/c/18915/.
            So let's see what will happen after such patch applied.

            yong.fan nasf (Inactive) added a comment - This looks like another failure instance of LU-7836 . Di has made a patch such ticket http://review.whamcloud.com/#/c/18915/ . So let's see what will happen after such patch applied.

            Before the OOM, there are a lot of waring message about the DNE (update) recovery failures/timeout.

            01:16:18:Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: import is not in FULL state
            01:16:18:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-04/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__38643__-70130437439340-230829/replay-single.test_70b.debug_log.$(hostname -s).1462435944.log;
            01:16:18:         dmesg > /logdir/test_logs/2016-05-04/lustre-revi
            01:16:18:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
            01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880046cb99c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:232/0 lens 808/0 e 0 to 0 dl 1462435987 ref 2 fl Interpret:/6/ffffffff rc 0/-1
            01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages
            01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436102/real 1462436102]  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 151 to 1 dl 1462436104 ref 2 fl Rpc:X/6/ffffffff rc 0/-1
            01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages
            01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out.
            01:16:18:  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 153 to 1 dl 1462436123 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
            01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages
            01:16:18:Lustre: lustre-MDT0000: Recovery already passed deadline 27:16, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
            01:16:18:Lustre: Skipped 471 previous similar messages
            01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp)
            01:38:29:Lustre: Skipped 469 previous similar messages
            01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete
            01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880077a450c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:77/0 lens 808/0 e 0 to 0 dl 1462436587 ref 2 fl Interpret:/6/ffffffff rc 0/-1
            01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages
            01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436707/real 1462436707]  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 206 to 1 dl 1462436709 ref 2 fl Rpc:X/6/ffffffff rc 0/-1
            01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages
            01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out.
            01:38:29:  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 208 to 1 dl 1462436728 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
            01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages
            01:38:29:Lustre: lustre-MDT0000: Recovery already passed deadline 37:17, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
            01:38:29:Lustre: Skipped 470 previous similar messages
            01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp)
            01:38:29:Lustre: Skipped 470 previous similar messages
            01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete
            ...
            

            It is suspected that because the DNE (update) recovery was blocked for a long time, because the the DNE (update) replay will not be aborted unless do that manually, related (update) replay requests were resent/re-enqueued repeatedly (by some improperly handling) as to occupied or leaked too much RAM, then caused the system OOM. I am not sure for that, but we can add some debug check in the recovery failure/timeout message. They may be related with the DNE (update) recovery logic that were introduced by the patches http://review.whamcloud.com/#/c/15793/ and http://review.whamcloud.com/#/c/17885/.

            Di, would you please to give some check for that, especially considere the case of DNE (update) recovery was blocked for hours. Thanks!

            yong.fan nasf (Inactive) added a comment - Before the OOM, there are a lot of waring message about the DNE (update) recovery failures/timeout. 01:16:18:Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: import is not in FULL state 01:16:18:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-04/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__38643__-70130437439340-230829/replay-single.test_70b.debug_log.$(hostname -s).1462435944.log; 01:16:18: dmesg > /logdir/test_logs/2016-05-04/lustre-revi 01:16:18:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null || true 01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880046cb99c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:232/0 lens 808/0 e 0 to 0 dl 1462435987 ref 2 fl Interpret:/6/ffffffff rc 0/-1 01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages 01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436102/real 1462436102] req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 151 to 1 dl 1462436104 ref 2 fl Rpc:X/6/ffffffff rc 0/-1 01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages 01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out. 01:16:18: req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 153 to 1 dl 1462436123 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages 01:16:18:Lustre: lustre-MDT0000: Recovery already passed deadline 27:16, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 01:16:18:Lustre: Skipped 471 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp) 01:38:29:Lustre: Skipped 469 previous similar messages 01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete 01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880077a450c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:77/0 lens 808/0 e 0 to 0 dl 1462436587 ref 2 fl Interpret:/6/ffffffff rc 0/-1 01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages 01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436707/real 1462436707] req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 206 to 1 dl 1462436709 ref 2 fl Rpc:X/6/ffffffff rc 0/-1 01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages 01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out. 01:38:29: req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 208 to 1 dl 1462436728 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Recovery already passed deadline 37:17, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 01:38:29:Lustre: Skipped 470 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp) 01:38:29:Lustre: Skipped 470 previous similar messages 01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete ... It is suspected that because the DNE (update) recovery was blocked for a long time, because the the DNE (update) replay will not be aborted unless do that manually, related (update) replay requests were resent/re-enqueued repeatedly (by some improperly handling) as to occupied or leaked too much RAM, then caused the system OOM. I am not sure for that, but we can add some debug check in the recovery failure/timeout message. They may be related with the DNE (update) recovery logic that were introduced by the patches http://review.whamcloud.com/#/c/15793/ and http://review.whamcloud.com/#/c/17885/ . Di, would you please to give some check for that, especially considere the case of DNE (update) recovery was blocked for hours. Thanks!

            It looks like the same issue is hitting relay-single test 70b. Logs are at
            2016-05-05 - https://testing.hpdd.intel.com/test_sets/289f1502-12e0-11e6-9e5d-5254006e85c2

            jamesanunez James Nunez (Inactive) added a comment - It looks like the same issue is hitting relay-single test 70b. Logs are at 2016-05-05 - https://testing.hpdd.intel.com/test_sets/289f1502-12e0-11e6-9e5d-5254006e85c2

            Hi Fan Yong,
            Can you please investigate this issue?
            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Fan Yong, Can you please investigate this issue? Thanks. Joe

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: