[LU-8070] sanity-scrub test_5 oom-killer and times out Created: 26/Apr/16  Updated: 06/Aug/16  Resolved: 06/Aug/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

autotest review-dne-part-2


Issue Links:
Related
is related to LU-7836 MDSes crashed with oom-killer Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-scrub test 5 times out in review-dne-part-2.

The last thing seen in the test_log is

CMD: onyx-34vm3 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0003.oi_scrub |
			awk '/^status/ { print \$2 }'
Starting client: onyx-34vm1.onyx.hpdd.intel.com:  -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
fail_val=3
fail_loc=0x190
fail_val=3
fail_loc=0x190
  File: `/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 144115205322834802  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-04-14 07:21:59.000000000 -0700
Modify: 2016-04-14 07:21:20.000000000 -0700
Change: 2016-04-14 07:22:00.000000000 -0700
  File: `/mnt/lustre/d5.sanity-scrub/mds2/f5.sanity-scrub800'
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d	Inode: 162129603832316783  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-04-14 07:22:08.000000000 -0700
Modify: 2016-04-14 07:21:25.000000000 -0700
Change: 2016-04-14 07:22:08.000000000 -0700

Most of the recent failures have the following in the MDS console log

18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:12
18:17:02:Lustre: Skipped 1 previous similar message
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
18:17:02:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000404:0x372:0x0], rc = 0 [2]
18:17:02:Lustre: lustre-MDT0002-o: trigger OI scrub by RPC for [0x280000404:0x1:0x0], rc = 0 [1]
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 11 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 2:13
18:17:02:Lustre: Skipped 6 previous similar messages
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 21 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
18:17:02:Lustre: Skipped 197 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:08
18:17:02:Lustre: Skipped 12 previous similar messages
18:17:02:Lustre: *** cfs_fail_loc=190, val=0***
18:17:02:Lustre: Skipped 43 previous similar messages
18:17:02:Lustre: lustre-MDT0002: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
…
18:34:04:Lustre: lustre-MDT0002: Recovery already passed deadline 46:15, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
18:34:04:Lustre: Skipped 299 previous similar messages
18:34:04:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo)
18:34:04:Lustre: Skipped 299 previous similar messages
18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460597515/real 1460597515]  req@ffff88005ca7a080 x1531531004454316/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460597517 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1
18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 99 previous similar messages
18:34:04:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0
18:34:04:mdt_out00_000 cpuset=/ mems_allowed=0
18:34:04:Pid: 24628, comm: mdt_out00_000 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1
18:34:04:Call Trace:
18:34:04: [<ffffffff810d7581>] ? cpuset_print_task_mems_allowed+0x91/0xb0
18:34:04: [<ffffffff8112ad80>] ? dump_header+0x90/0x1b0
18:34:04: [<ffffffff8112aeee>] ? check_panic_on_oom+0x4e/0x80
18:34:04: [<ffffffff8112b5db>] ? out_of_memory+0x1bb/0x3c0
18:34:04: [<ffffffff8113801c>] ? __alloc_pages_nodemask+0x93c/0x950
18:34:04: [<ffffffff81161a69>] ? __vmalloc_area_node+0xb9/0x190
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffff8116193d>] ? __vmalloc_node+0xad/0x120
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffff81161d59>] ? vzalloc_node+0x29/0x30
18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc]
18:34:04: [<ffffffffa07c0c9d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc]
18:34:04: [<ffffffffa07bfb50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc]
18:34:04: [<ffffffff810a138e>] ? kthread+0x9e/0xc0
18:34:04: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
18:34:04: [<ffffffff810a12f0>] ? kthread+0x0/0xc0
18:34:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20
18:34:04:Mem-Info:
18:34:04:Node 0 DMA per-cpu:
18:34:04:CPU    0: hi:    0, btch:   1 usd:   0
18:34:04:CPU    1: hi:    0, btch:   1 usd:   0
18:34:04:Node 0 DMA32 per-cpu:
18:34:04:CPU    0: hi:  186, btch:  31 usd:   0
18:34:04:CPU    1: hi:  186, btch:  31 usd:   0
18:34:04:active_anon:0 inactive_anon:10 isolated_anon:0
18:34:04: active_file:137 inactive_file:100 isolated_file:0
18:34:04: unevictable:0 dirty:7 writeback:0 unstable:0
18:34:04: free:13252 slab_reclaimable:1920 slab_unreclaimable:331018
18:34:04: mapped:0 shmem:0 pagetables:1420 bounce:0
18:34:04:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7180kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
18:34:04:lowmem_reserve[]: 0 2004 2004 2004
18:34:04:Node 0 DMA32 free:44668kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:40kB active_file:548kB inactive_file:396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:28kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7680kB slab_unreclaimable:1316892kB kernel_stack:3920kB pagetables:5680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:250 all_unreclaimable? no
18:34:04:lowmem_reserve[]: 0 0 0 0
18:34:04:Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8344kB
18:34:04:Node 0 DMA32: 1491*4kB 364*8kB 143*16kB 81*32kB 59*64kB 40*128kB 18*256kB 12*512kB 7*1024kB 2*2048kB 0*4096kB = 44668kB
18:34:04:237 total pagecache pages
18:34:04:8 pages in swap cache
18:34:04:Swap cache stats: add 21677, delete 21669, find 6474/8836
18:34:04:Free swap  = 4105924kB
18:34:04:Total swap = 4128764kB
18:34:04:524284 pages RAM
18:34:04:43737 pages reserved
18:34:04:257 pages shared
18:34:04:142747 pages non-shared
18:34:04:[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
18:34:04:[  456]     0   456     2722        0   0     -17         -1000 udevd
18:34:04:[  763]     0   763     2702        0   0     -17         -1000 udevd
18:34:04:[  915]     0   915     2701        0   1     -17         -1000 udevd
18:34:04:[ 1175]     0  1175     2280        1   1       0             0 dhclient
18:34:04:[ 1238]     0  1238    23289        1   0     -17         -1000 auditd
18:34:04:[ 1272]     0  1272    63903        3   0       0             0 rsyslogd
18:34:04:[ 1306]     0  1306     4561        1   0       0             0 irqbalance
18:34:04:[ 1324]    32  1324     4744        1   0       0             0 rpcbind
18:34:04:[ 1340]     0  1340    52834        1   1       0             0 sssd
18:34:04:[ 1341]     0  1341    76810        1   0       0             0 sssd_be
18:34:04:[ 1342]     0  1342    54194        1   0       0             0 sssd_nss
18:34:04:[ 1343]     0  1343    50551        1   1       0             0 sssd_pam
18:34:04:[ 1344]     0  1344    51904        1   0       0             0 sssd_ssh
18:34:04:[ 1345]     0  1345    56336        1   1       0             0 sssd_pac
18:34:04:[ 1367]    29  1367     6357        1   1       0             0 rpc.statd
18:34:04:[ 1401]    81  1401     5878        1   0       0             0 dbus-daemon
18:34:04:[ 1422]     0  1422    47238        1   1       0             0 cupsd
18:34:04:[ 1466]     0  1466     1020        0   1       0             0 acpid
18:34:04:[ 1478]    68  1478    10495        1   0       0             0 hald
18:34:04:[ 1479]     0  1479     5100        1   1       0             0 hald-runner
18:34:04:[ 1511]     0  1511     5630        1   1       0             0 hald-addon-inpu
18:34:04:[ 1518]    68  1518     4502        1   1       0             0 hald-addon-acpi
18:34:04:[ 1545]     0  1545   169286        1   1       0             0 automount
18:34:04:[ 1599]     0  1599    26827        0   0       0             0 rpc.rquotad
18:34:04:[ 1604]     0  1604     5417        0   0       0             0 rpc.mountd
18:34:04:[ 1649]     0  1649     5774        1   0       0             0 rpc.idmapd
18:34:04:[ 1686]   496  1686    56788        1   1       0             0 munged
18:34:04:[ 8281]     0  8281     5429        1   1       0             0 xinetd
18:34:04:[ 8373]     0  8373    20741        1   0       0             0 master
18:34:04:[ 8384]    89  8384    20804        1   0       0             0 qmgr
18:34:04:[ 8402]     0  8402    29216        1   0       0             0 crond
18:34:04:[ 8420]     0  8420     5276        0   1       0             0 atd
18:34:04:[ 8452]     0  8452    16119        1   0       0             0 certmonger
18:34:04:[ 8488]     0  8488     1020        1   0       0             0 agetty
18:34:04:[ 8492]     0  8492     1016        1   1       0             0 mingetty
18:34:04:[ 8496]     0  8496     1016        1   1       0             0 mingetty
18:34:04:[ 8502]     0  8502     1016        1   0       0             0 mingetty
18:34:04:[ 8506]     0  8506     1016        1   0       0             0 mingetty
18:34:04:[ 8508]     0  8508     1016        1   0       0             0 mingetty
18:34:04:[ 8511]     0  8511     1016        1   0       0             0 mingetty
18:34:04:[ 9627]     0  9627    16557        0   1     -17         -1000 sshd
18:34:04:[10443]    38 10443     8207        1   0       0             0 ntpd
18:34:04:[19865]    89 19865    20761        1   0       0             0 pickup
18:34:04:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
18:34:04:

Some recent failure with logs are at
https://testing.hpdd.intel.com/test_sets/d08cd624-02b3-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f4316e66-0284-11e6-9e5d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/7ea41ce2-023f-11e6-83bc-5254006e85c2



 Comments   
Comment by Joseph Gmitter (Inactive) [ 27/Apr/16 ]

Hi Fan Yong,
Can you please investigate this issue?
Thanks.
Joe

Comment by James Nunez (Inactive) [ 06/May/16 ]

It looks like the same issue is hitting relay-single test 70b. Logs are at
2016-05-05 - https://testing.hpdd.intel.com/test_sets/289f1502-12e0-11e6-9e5d-5254006e85c2

Comment by nasf (Inactive) [ 10/May/16 ]

Before the OOM, there are a lot of waring message about the DNE (update) recovery failures/timeout.

01:16:18:Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: import is not in FULL state
01:16:18:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-04/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__38643__-70130437439340-230829/replay-single.test_70b.debug_log.$(hostname -s).1462435944.log;
01:16:18:         dmesg > /logdir/test_logs/2016-05-04/lustre-revi
01:16:18:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null || true
01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880046cb99c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:232/0 lens 808/0 e 0 to 0 dl 1462435987 ref 2 fl Interpret:/6/ffffffff rc 0/-1
01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages
01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436102/real 1462436102]  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 151 to 1 dl 1462436104 ref 2 fl Rpc:X/6/ffffffff rc 0/-1
01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages
01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out.
01:16:18:  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 153 to 1 dl 1462436123 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages
01:16:18:Lustre: lustre-MDT0000: Recovery already passed deadline 27:16, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
01:16:18:Lustre: Skipped 471 previous similar messages
01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp)
01:38:29:Lustre: Skipped 469 previous similar messages
01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete
01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880077a450c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:77/0 lens 808/0 e 0 to 0 dl 1462436587 ref 2 fl Interpret:/6/ffffffff rc 0/-1
01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages
01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436707/real 1462436707]  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 206 to 1 dl 1462436709 ref 2 fl Rpc:X/6/ffffffff rc 0/-1
01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages
01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out.
01:38:29:  req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 208 to 1 dl 1462436728 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1
01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages
01:38:29:Lustre: lustre-MDT0000: Recovery already passed deadline 37:17, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery.
01:38:29:Lustre: Skipped 470 previous similar messages
01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp)
01:38:29:Lustre: Skipped 470 previous similar messages
01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete
...

It is suspected that because the DNE (update) recovery was blocked for a long time, because the the DNE (update) replay will not be aborted unless do that manually, related (update) replay requests were resent/re-enqueued repeatedly (by some improperly handling) as to occupied or leaked too much RAM, then caused the system OOM. I am not sure for that, but we can add some debug check in the recovery failure/timeout message. They may be related with the DNE (update) recovery logic that were introduced by the patches http://review.whamcloud.com/#/c/15793/ and http://review.whamcloud.com/#/c/17885/.

Di, would you please to give some check for that, especially considere the case of DNE (update) recovery was blocked for hours. Thanks!

Comment by nasf (Inactive) [ 11/May/16 ]

This looks like another failure instance of LU-7836. Di has made a patch such ticket http://review.whamcloud.com/#/c/18915/.
So let's see what will happen after such patch applied.

Comment by Di Wang [ 20/Jul/16 ]

I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck.

[ 4947.935836] lod0000_rec0001 S 000000000000d918     0 19136      2 0x00000080
[ 4947.935836]  ffff88004caff750 0000000000000046 ffff880035ee6780 ffff88004cafffd8
[ 4947.935836]  ffff88004cafffd8 ffff88004cafffd8 ffff880035ee6780 ffff88007c0ac000
[ 4947.935836]  ffff88004caff780 000000010046ebf8 ffff88007c0ac000 000000000000d918
[ 4947.935836] Call Trace:
[ 4947.935836]  [<ffffffff8163ba29>] schedule+0x29/0x70
[ 4947.935836]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
[ 4947.935836]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
[ 4947.935836]  [<ffffffffa09ff8e0>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc]
[ 4947.935836]  [<ffffffffa0a09670>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc]
[ 4947.935836]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
[ 4947.935836]  [<ffffffffa0a09b2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
[ 4947.935836]  [<ffffffffa0f77e99>] osp_remote_sync+0xd9/0x1c0 [osp]
[ 4947.935836]  [<ffffffffa0f5ffdc>] osp_attr_get+0x44c/0x6e0 [osp]
[ 4947.935836]  [<ffffffffa0f5b3b2>] osp_object_init+0x152/0x2c0 [osp]
[ 4947.935836]  [<ffffffffa07f13ef>] lu_object_alloc+0xdf/0x310 [obdclass]
[ 4947.935836]  [<ffffffffa07f17ec>] lu_object_find_try+0x16c/0x2b0 [obdclass]
[ 4947.935836]  [<ffffffffa07f19dc>] lu_object_find_at+0xac/0xe0 [obdclass]
[ 4947.935836]  [<ffffffffa06be498>] ? libcfs_debug_vmsg2+0x6f8/0xb60 [libcfs]
[ 4947.935836]  [<ffffffffa07f2d68>] dt_locate_at+0x18/0xb0 [obdclass]
[ 4947.935836]  [<ffffffffa07b3368>] llog_osd_open+0xf8/0xdf0 [obdclass]
[ 4947.935836]  [<ffffffffa07a283a>] llog_open+0x13a/0x3b0 [obdclass]
[ 4947.935836]  [<ffffffffa07aa1dd>] llog_cat_id2handle+0x28d/0x660 [obdclass]
[ 4947.935836]  [<ffffffffa06be957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 4947.935836]  [<ffffffffa07ab989>] llog_cat_process_cb+0x139/0x600 [obdclass]
[ 4947.935836]  [<ffffffffa07a59f5>] llog_process_thread+0x5a5/0x10c0 [obdclass]
[ 4947.935836]  [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass]
[ 4947.935836]  [<ffffffffa07a65cc>] llog_process_or_fork+0xbc/0x430 [obdclass]
[ 4947.935836]  [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass]
[ 4947.935836]  [<ffffffffa07aacd1>] llog_cat_process_or_fork+0x1b1/0x2f0 [obdclass]
[ 4947.935836]  [<ffffffffa0ec50ca>] ? lod_sub_prep_llog+0x24a/0x7ec [lod]
[ 4947.935836]  [<ffffffffa0e9c670>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
[ 4947.935836]  [<ffffffffa07aae3e>] llog_cat_process+0x2e/0x30 [obdclass]
[ 4947.935836]  [<ffffffffa0e992b6>] lod_sub_recovery_thread+0x3a6/0xaf0 [lod]
[ 4947.935836]  [<ffffffffa0e98f10>] ? lod_obd_set_info_async+0xa90/0xa90 [lod]
[ 4947.935836]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 4947.935836]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
[ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 4947.935836] tgt_recover_0   D 000000000000d918     0 19139      2 0x00000080
[ 4947.935836]  ffff88004d4a7cc0 0000000000000046 ffff880035ee7300 ffff88004d4a7fd8
[ 4947.935836]  ffff88004d4a7fd8 ffff88004d4a7fd8 ffff880035ee7300 ffff88007c0ac000
[ 4947.935836]  ffff88004d4a7cf0 0000000100471ac7 ffff88007c0ac000 000000000000d918
[ 4947.935836] Call Trace:
[ 4947.935836]  [<ffffffff8163ba29>] schedule+0x29/0x70
[ 4947.935836]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
[ 4947.935836]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
[ 4947.935836]  [<ffffffffa09d07e0>] ? handle_recovery_req+0x270/0x270 [ptlrpc]
[ 4947.935836]  [<ffffffffa09d2198>] target_recovery_overseer+0xd8/0x650 [ptlrpc]
[ 4947.935836]  [<ffffffffa09cecc0>] ? ldlm_dump_ns_write+0x90/0x90 [ptlrpc]
[ 4947.935836]  [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20
[ 4947.935836]  [<ffffffff810af038>] ? __wake_up_common+0x58/0x90
[ 4947.935836]  [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30
[ 4947.935836]  [<ffffffffa09daf71>] target_recovery_thread+0x5d1/0x1360 [ptlrpc]
[ 4947.935836]  [<ffffffff8163b3d8>] ? __schedule+0x2d8/0x900
[ 4947.935836]  [<ffffffffa09da9a0>] ? replay_request_or_update.isra.20+0x8c0/0x8c0 [ptlrpc]
[ 4947.935836]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[ 4947.935836]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
[ 4947.935836]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Comment by Di Wang [ 20/Jul/16 ]

Hmm, it looks like MDT0 is doing scrub

[16482.903022] OI_scrub        S 000000000000d918     0 15926      2 0x00000080
[16482.903022]  ffff88004f7b7bb0 0000000000000046 ffff880053a45080 ffff88004f7b7fd8
[16482.903022]  ffff88004f7b7fd8 ffff88004f7b7fd8 ffff880053a45080 ffffffff81daba00
[16482.903022]  ffff88004f7b7be0 0000000100f6f3a9 ffffffff81daba00 000000000000d918
[16482.903022] Call Trace:
[16482.903022]  [<ffffffff8163ba29>] schedule+0x29/0x70
[16482.903022]  [<ffffffff81639685>] schedule_timeout+0x175/0x2d0
[16482.903022]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
[16482.903022]  [<ffffffffa0c45852>] osd_scrub_next+0x2a2/0x3d0 [osd_ldiskfs]
[16482.903022]  [<ffffffffa0c497c5>] ? osd_scrub_exec+0x65/0x600 [osd_ldiskfs]
[16482.903022]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
[16482.903022]  [<ffffffffa0c4ae20>] osd_inode_iteration+0x310/0xd70 [osd_ldiskfs]
[16482.903022]  [<ffffffffa0c49760>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs]
[16482.903022]  [<ffffffffa0c455b0>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs]
[16482.903022]  [<ffffffffa0c4c1e0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs]
[16482.903022]  [<ffffffffa0c4b880>] ? osd_inode_iteration+0xd70/0xd70 [osd_ldiskfs]
[16482.903022]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[16482.903022]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[16482.903022]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
[16482.903022]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140

FanYong, Will this stop MDT0 accepting any connection?

Comment by nasf (Inactive) [ 25/Jul/16 ]

The OI scrub has nothing related with connection operation, it only affects FID based operation (if no name lookup before that): if the OI mapping for the target FID is NOT ready, it will return -115 (-EINPROGRESS) to the caller. Generally, the caller or related RPC sponsor needs to retry related request or RPC some time later under such case. In this test, the OI scrub is expected. On the other hand, I wonder why the osp_remote_sync() was not sent out or replied? Since OI scrub is only local operation, it will not affect the ptlrpc layer behaviour.

Comment by nasf (Inactive) [ 26/Jul/16 ]

I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck.

Wangdi, you have set ptlrpc_request::rq_allow_replay unconditional for osp_remote_sync(), (although I do not think it is right under some cases). Then even though it was in recovery, the OUT RPC for update logs still can be sent out. Since it did not, is it possible anything wrong for the OSP import status? Or we need more work to adjust ptlrpc_import_delay_req() behaviour?

Comment by nasf (Inactive) [ 29/Jul/16 ]

According to the original bug description, there were many "ping" RPC timeout on the peer MDT:

18:17:02:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460596315/real 1460596315]  req@ffff88007ac75380 x1531531004437356/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460596317 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1

That matches the scenario described in LU-7836: there were too much duplicate final ping RPCs on the MDT during the recovery, as to RAM exhausted. So the patch http://review.whamcloud.com/#/c/19693/ should have already fixed the trouble.

I would wait for Frank's feedback on the LU-7836 before closing it.

Comment by nasf (Inactive) [ 06/Aug/16 ]

It is the duplication of LU-7836.

Generated at Sat Feb 10 02:14:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.