[LU-8070] sanity-scrub test_5 oom-killer and times out Created: 26/Apr/16 Updated: 06/Aug/16 Resolved: 06/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne-part-2 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
sanity-scrub test 5 times out in review-dne-part-2. The last thing seen in the test_log is CMD: onyx-34vm3 /usr/sbin/lctl get_param -n osd-ldiskfs.lustre-MDT0003.oi_scrub |
awk '/^status/ { print \$2 }'
Starting client: onyx-34vm1.onyx.hpdd.intel.com: -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mkdir -p /mnt/lustre
CMD: onyx-34vm1.onyx.hpdd.intel.com mount -t lustre -o user_xattr,flock onyx-34vm7@tcp:/lustre /mnt/lustre
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0
CMD: onyx-34vm3,onyx-34vm7 /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190
fail_val=3
fail_loc=0x190
fail_val=3
fail_loc=0x190
File: `/mnt/lustre/d5.sanity-scrub/mds1/f5.sanity-scrub800'
Size: 0 Blocks: 0 IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d Inode: 144115205322834802 Links: 1
Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-04-14 07:21:59.000000000 -0700
Modify: 2016-04-14 07:21:20.000000000 -0700
Change: 2016-04-14 07:22:00.000000000 -0700
File: `/mnt/lustre/d5.sanity-scrub/mds2/f5.sanity-scrub800'
Size: 0 Blocks: 0 IO Block: 4194304 regular empty file
Device: 2c54f966h/743766374d Inode: 162129603832316783 Links: 1
Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-04-14 07:22:08.000000000 -0700
Modify: 2016-04-14 07:21:25.000000000 -0700
Change: 2016-04-14 07:22:08.000000000 -0700
Most of the recent failures have the following in the MDS console log 18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:12 18:17:02:Lustre: Skipped 1 previous similar message 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd-ldiskfs.*.full_scrub_ratio=0 18:17:02:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_val=3 fail_loc=0x190 18:17:02:Lustre: lustre-MDT0000: trigger OI scrub by RPC for [0x200000404:0x372:0x0], rc = 0 [2] 18:17:02:Lustre: lustre-MDT0002-o: trigger OI scrub by RPC for [0x280000404:0x1:0x0], rc = 0 [1] 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 11 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 2:13 18:17:02:Lustre: Skipped 6 previous similar messages 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 21 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo) 18:17:02:Lustre: Skipped 197 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Denying connection for new client 4a5a50c5-f60a-3956-6fde-f0ee11bce605(at 10.2.4.183@tcp), waiting for 3 known clients (3 recovered, 0 in progress, and 0 evicted) to recover in 1:08 18:17:02:Lustre: Skipped 12 previous similar messages 18:17:02:Lustre: *** cfs_fail_loc=190, val=0*** 18:17:02:Lustre: Skipped 43 previous similar messages 18:17:02:Lustre: lustre-MDT0002: Recovery already passed deadline 0:03, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. … 18:34:04:Lustre: lustre-MDT0002: Recovery already passed deadline 46:15, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 18:34:04:Lustre: Skipped 299 previous similar messages 18:34:04:Lustre: lustre-MDT0002: Connection restored to 10.2.4.189@tcp (at 0@lo) 18:34:04:Lustre: Skipped 299 previous similar messages 18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460597515/real 1460597515] req@ffff88005ca7a080 x1531531004454316/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460597517 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 18:34:04:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 99 previous similar messages 18:34:04:mdt_out00_000 invoked oom-killer: gfp_mask=0x80d2, order=0, oom_adj=0, oom_score_adj=0 18:34:04:mdt_out00_000 cpuset=/ mems_allowed=0 18:34:04:Pid: 24628, comm: mdt_out00_000 Not tainted 2.6.32-573.22.1.el6_lustre.x86_64 #1 18:34:04:Call Trace: 18:34:04: [<ffffffff810d7581>] ? cpuset_print_task_mems_allowed+0x91/0xb0 18:34:04: [<ffffffff8112ad80>] ? dump_header+0x90/0x1b0 18:34:04: [<ffffffff8112aeee>] ? check_panic_on_oom+0x4e/0x80 18:34:04: [<ffffffff8112b5db>] ? out_of_memory+0x1bb/0x3c0 18:34:04: [<ffffffff8113801c>] ? __alloc_pages_nodemask+0x93c/0x950 18:34:04: [<ffffffff81161a69>] ? __vmalloc_area_node+0xb9/0x190 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffff8116193d>] ? __vmalloc_node+0xad/0x120 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffff81161d59>] ? vzalloc_node+0x29/0x30 18:34:04: [<ffffffffa07be3be>] ? ptlrpc_grow_req_bufs+0x33e/0x710 [ptlrpc] 18:34:04: [<ffffffffa07c0c9d>] ? ptlrpc_main+0x114d/0x1800 [ptlrpc] 18:34:04: [<ffffffffa07bfb50>] ? ptlrpc_main+0x0/0x1800 [ptlrpc] 18:34:04: [<ffffffff810a138e>] ? kthread+0x9e/0xc0 18:34:04: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 18:34:04: [<ffffffff810a12f0>] ? kthread+0x0/0xc0 18:34:04: [<ffffffff8100c280>] ? child_rip+0x0/0x20 18:34:04:Mem-Info: 18:34:04:Node 0 DMA per-cpu: 18:34:04:CPU 0: hi: 0, btch: 1 usd: 0 18:34:04:CPU 1: hi: 0, btch: 1 usd: 0 18:34:04:Node 0 DMA32 per-cpu: 18:34:04:CPU 0: hi: 186, btch: 31 usd: 0 18:34:04:CPU 1: hi: 186, btch: 31 usd: 0 18:34:04:active_anon:0 inactive_anon:10 isolated_anon:0 18:34:04: active_file:137 inactive_file:100 isolated_file:0 18:34:04: unevictable:0 dirty:7 writeback:0 unstable:0 18:34:04: free:13252 slab_reclaimable:1920 slab_unreclaimable:331018 18:34:04: mapped:0 shmem:0 pagetables:1420 bounce:0 18:34:04:Node 0 DMA free:8340kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15348kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:7180kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no 18:34:04:lowmem_reserve[]: 0 2004 2004 2004 18:34:04:Node 0 DMA32 free:44668kB min:44720kB low:55900kB high:67080kB active_anon:0kB inactive_anon:40kB active_file:548kB inactive_file:396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052308kB mlocked:0kB dirty:28kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:7680kB slab_unreclaimable:1316892kB kernel_stack:3920kB pagetables:5680kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:250 all_unreclaimable? no 18:34:04:lowmem_reserve[]: 0 0 0 0 18:34:04:Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 2*2048kB 1*4096kB = 8344kB 18:34:04:Node 0 DMA32: 1491*4kB 364*8kB 143*16kB 81*32kB 59*64kB 40*128kB 18*256kB 12*512kB 7*1024kB 2*2048kB 0*4096kB = 44668kB 18:34:04:237 total pagecache pages 18:34:04:8 pages in swap cache 18:34:04:Swap cache stats: add 21677, delete 21669, find 6474/8836 18:34:04:Free swap = 4105924kB 18:34:04:Total swap = 4128764kB 18:34:04:524284 pages RAM 18:34:04:43737 pages reserved 18:34:04:257 pages shared 18:34:04:142747 pages non-shared 18:34:04:[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name 18:34:04:[ 456] 0 456 2722 0 0 -17 -1000 udevd 18:34:04:[ 763] 0 763 2702 0 0 -17 -1000 udevd 18:34:04:[ 915] 0 915 2701 0 1 -17 -1000 udevd 18:34:04:[ 1175] 0 1175 2280 1 1 0 0 dhclient 18:34:04:[ 1238] 0 1238 23289 1 0 -17 -1000 auditd 18:34:04:[ 1272] 0 1272 63903 3 0 0 0 rsyslogd 18:34:04:[ 1306] 0 1306 4561 1 0 0 0 irqbalance 18:34:04:[ 1324] 32 1324 4744 1 0 0 0 rpcbind 18:34:04:[ 1340] 0 1340 52834 1 1 0 0 sssd 18:34:04:[ 1341] 0 1341 76810 1 0 0 0 sssd_be 18:34:04:[ 1342] 0 1342 54194 1 0 0 0 sssd_nss 18:34:04:[ 1343] 0 1343 50551 1 1 0 0 sssd_pam 18:34:04:[ 1344] 0 1344 51904 1 0 0 0 sssd_ssh 18:34:04:[ 1345] 0 1345 56336 1 1 0 0 sssd_pac 18:34:04:[ 1367] 29 1367 6357 1 1 0 0 rpc.statd 18:34:04:[ 1401] 81 1401 5878 1 0 0 0 dbus-daemon 18:34:04:[ 1422] 0 1422 47238 1 1 0 0 cupsd 18:34:04:[ 1466] 0 1466 1020 0 1 0 0 acpid 18:34:04:[ 1478] 68 1478 10495 1 0 0 0 hald 18:34:04:[ 1479] 0 1479 5100 1 1 0 0 hald-runner 18:34:04:[ 1511] 0 1511 5630 1 1 0 0 hald-addon-inpu 18:34:04:[ 1518] 68 1518 4502 1 1 0 0 hald-addon-acpi 18:34:04:[ 1545] 0 1545 169286 1 1 0 0 automount 18:34:04:[ 1599] 0 1599 26827 0 0 0 0 rpc.rquotad 18:34:04:[ 1604] 0 1604 5417 0 0 0 0 rpc.mountd 18:34:04:[ 1649] 0 1649 5774 1 0 0 0 rpc.idmapd 18:34:04:[ 1686] 496 1686 56788 1 1 0 0 munged 18:34:04:[ 8281] 0 8281 5429 1 1 0 0 xinetd 18:34:04:[ 8373] 0 8373 20741 1 0 0 0 master 18:34:04:[ 8384] 89 8384 20804 1 0 0 0 qmgr 18:34:04:[ 8402] 0 8402 29216 1 0 0 0 crond 18:34:04:[ 8420] 0 8420 5276 0 1 0 0 atd 18:34:04:[ 8452] 0 8452 16119 1 0 0 0 certmonger 18:34:04:[ 8488] 0 8488 1020 1 0 0 0 agetty 18:34:04:[ 8492] 0 8492 1016 1 1 0 0 mingetty 18:34:04:[ 8496] 0 8496 1016 1 1 0 0 mingetty 18:34:04:[ 8502] 0 8502 1016 1 0 0 0 mingetty 18:34:04:[ 8506] 0 8506 1016 1 0 0 0 mingetty 18:34:04:[ 8508] 0 8508 1016 1 0 0 0 mingetty 18:34:04:[ 8511] 0 8511 1016 1 0 0 0 mingetty 18:34:04:[ 9627] 0 9627 16557 0 1 -17 -1000 sshd 18:34:04:[10443] 38 10443 8207 1 0 0 0 ntpd 18:34:04:[19865] 89 19865 20761 1 0 0 0 pickup 18:34:04:Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled 18:34:04: Some recent failure with logs are at |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 27/Apr/16 ] |
|
Hi Fan Yong, |
| Comment by James Nunez (Inactive) [ 06/May/16 ] |
|
It looks like the same issue is hitting relay-single test 70b. Logs are at |
| Comment by nasf (Inactive) [ 10/May/16 ] |
|
Before the OOM, there are a lot of waring message about the DNE (update) recovery failures/timeout. 01:16:18:Lustre: DEBUG MARKER: replay-single test_70b: @@@@@@ FAIL: import is not in FULL state 01:16:18:Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /logdir/test_logs/2016-05-04/lustre-reviews-el6_7-x86_64--review-dne-part-2--1_7_1__38643__-70130437439340-230829/replay-single.test_70b.debug_log.$(hostname -s).1462435944.log; 01:16:18: dmesg > /logdir/test_logs/2016-05-04/lustre-revi 01:16:18:Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 fail_val=0 2>/dev/null || true 01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880046cb99c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:232/0 lens 808/0 e 0 to 0 dl 1462435987 ref 2 fl Interpret:/6/ffffffff rc 0/-1 01:16:18:LustreError: 24228:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages 01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436102/real 1462436102] req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 151 to 1 dl 1462436104 ref 2 fl Rpc:X/6/ffffffff rc 0/-1 01:16:18:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages 01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out. 01:16:18: req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 153 to 1 dl 1462436123 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 01:16:18:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages 01:16:18:Lustre: lustre-MDT0000: Recovery already passed deadline 27:16, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 01:16:18:Lustre: Skipped 471 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp) 01:38:29:Lustre: Skipped 469 previous similar messages 01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete 01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880077a450c0 x1533470840692160/t0(279172874558) o101->5a7cd28e-174d-5b0d-e424-5505b1c45373@10.2.4.138@tcp:77/0 lens 808/0 e 0 to 0 dl 1462436587 ref 2 fl Interpret:/6/ffffffff rc 0/-1 01:38:29:LustreError: 24229:0:(ldlm_lib.c:2751:target_queue_recovery_request()) Skipped 274 previous similar messages 01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1462436707/real 1462436707] req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 206 to 1 dl 1462436709 ref 2 fl Rpc:X/6/ffffffff rc 0/-1 01:38:29:Lustre: 5491:0:(client.c:2067:ptlrpc_expire_one_request()) Skipped 92 previous similar messages 01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) @@@ request replay timed out. 01:38:29: req@ffff88007a7c00c0 x1533470264772256/t279172874557(279172874557) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 1752/4288 e 208 to 1 dl 1462436728 ref 2 fl Interpret:EX/6/ffffffff rc -110/-1 01:38:29:LustreError: 5491:0:(client.c:2878:ptlrpc_replay_interpret()) Skipped 91 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Recovery already passed deadline 37:17, It is most likely due to DNE recovery is failed or stuck, please wait a few more minutes or abort the recovery. 01:38:29:Lustre: Skipped 470 previous similar messages 01:38:29:Lustre: lustre-MDT0000: Connection restored to 10.2.4.140@tcp (at 10.2.4.140@tcp) 01:38:29:Lustre: Skipped 470 previous similar messages 01:38:29:Lustre: lustre-MDT0003-osp-MDT0000: Connection to lustre-MDT0003 (at 10.2.4.140@tcp) was lost; in progress operations using this service will wait for recovery to complete ... It is suspected that because the DNE (update) recovery was blocked for a long time, because the the DNE (update) replay will not be aborted unless do that manually, related (update) replay requests were resent/re-enqueued repeatedly (by some improperly handling) as to occupied or leaked too much RAM, then caused the system OOM. I am not sure for that, but we can add some debug check in the recovery failure/timeout message. They may be related with the DNE (update) recovery logic that were introduced by the patches http://review.whamcloud.com/#/c/15793/ and http://review.whamcloud.com/#/c/17885/. Di, would you please to give some check for that, especially considere the case of DNE (update) recovery was blocked for hours. Thanks! |
| Comment by nasf (Inactive) [ 11/May/16 ] |
|
This looks like another failure instance of |
| Comment by Di Wang [ 20/Jul/16 ] |
|
I checked a few failure, it seems MDS recovery is stucked during getting the update log, which cause the timeout. OOM is another issue, which is the consequence of this stuck. [ 4947.935836] lod0000_rec0001 S 000000000000d918 0 19136 2 0x00000080 [ 4947.935836] ffff88004caff750 0000000000000046 ffff880035ee6780 ffff88004cafffd8 [ 4947.935836] ffff88004cafffd8 ffff88004cafffd8 ffff880035ee6780 ffff88007c0ac000 [ 4947.935836] ffff88004caff780 000000010046ebf8 ffff88007c0ac000 000000000000d918 [ 4947.935836] Call Trace: [ 4947.935836] [<ffffffff8163ba29>] schedule+0x29/0x70 [ 4947.935836] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [ 4947.935836] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [ 4947.935836] [<ffffffffa09ff8e0>] ? ptlrpc_init_rq_pool+0x110/0x110 [ptlrpc] [ 4947.935836] [<ffffffffa0a09670>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc] [ 4947.935836] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [ 4947.935836] [<ffffffffa0a09b2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] [ 4947.935836] [<ffffffffa0f77e99>] osp_remote_sync+0xd9/0x1c0 [osp] [ 4947.935836] [<ffffffffa0f5ffdc>] osp_attr_get+0x44c/0x6e0 [osp] [ 4947.935836] [<ffffffffa0f5b3b2>] osp_object_init+0x152/0x2c0 [osp] [ 4947.935836] [<ffffffffa07f13ef>] lu_object_alloc+0xdf/0x310 [obdclass] [ 4947.935836] [<ffffffffa07f17ec>] lu_object_find_try+0x16c/0x2b0 [obdclass] [ 4947.935836] [<ffffffffa07f19dc>] lu_object_find_at+0xac/0xe0 [obdclass] [ 4947.935836] [<ffffffffa06be498>] ? libcfs_debug_vmsg2+0x6f8/0xb60 [libcfs] [ 4947.935836] [<ffffffffa07f2d68>] dt_locate_at+0x18/0xb0 [obdclass] [ 4947.935836] [<ffffffffa07b3368>] llog_osd_open+0xf8/0xdf0 [obdclass] [ 4947.935836] [<ffffffffa07a283a>] llog_open+0x13a/0x3b0 [obdclass] [ 4947.935836] [<ffffffffa07aa1dd>] llog_cat_id2handle+0x28d/0x660 [obdclass] [ 4947.935836] [<ffffffffa06be957>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 4947.935836] [<ffffffffa07ab989>] llog_cat_process_cb+0x139/0x600 [obdclass] [ 4947.935836] [<ffffffffa07a59f5>] llog_process_thread+0x5a5/0x10c0 [obdclass] [ 4947.935836] [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass] [ 4947.935836] [<ffffffffa07a65cc>] llog_process_or_fork+0xbc/0x430 [obdclass] [ 4947.935836] [<ffffffffa07ab850>] ? llog_cat_cancel_records+0x340/0x340 [obdclass] [ 4947.935836] [<ffffffffa07aacd1>] llog_cat_process_or_fork+0x1b1/0x2f0 [obdclass] [ 4947.935836] [<ffffffffa0ec50ca>] ? lod_sub_prep_llog+0x24a/0x7ec [lod] [ 4947.935836] [<ffffffffa0e9c670>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] [ 4947.935836] [<ffffffffa07aae3e>] llog_cat_process+0x2e/0x30 [obdclass] [ 4947.935836] [<ffffffffa0e992b6>] lod_sub_recovery_thread+0x3a6/0xaf0 [lod] [ 4947.935836] [<ffffffffa0e98f10>] ? lod_obd_set_info_async+0xa90/0xa90 [lod] [ 4947.935836] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] tgt_recover_0 D 000000000000d918 0 19139 2 0x00000080 [ 4947.935836] ffff88004d4a7cc0 0000000000000046 ffff880035ee7300 ffff88004d4a7fd8 [ 4947.935836] ffff88004d4a7fd8 ffff88004d4a7fd8 ffff880035ee7300 ffff88007c0ac000 [ 4947.935836] ffff88004d4a7cf0 0000000100471ac7 ffff88007c0ac000 000000000000d918 [ 4947.935836] Call Trace: [ 4947.935836] [<ffffffff8163ba29>] schedule+0x29/0x70 [ 4947.935836] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [ 4947.935836] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [ 4947.935836] [<ffffffffa09d07e0>] ? handle_recovery_req+0x270/0x270 [ptlrpc] [ 4947.935836] [<ffffffffa09d2198>] target_recovery_overseer+0xd8/0x650 [ptlrpc] [ 4947.935836] [<ffffffffa09cecc0>] ? ldlm_dump_ns_write+0x90/0x90 [ptlrpc] [ 4947.935836] [<ffffffff810b88d2>] ? default_wake_function+0x12/0x20 [ 4947.935836] [<ffffffff810af038>] ? __wake_up_common+0x58/0x90 [ 4947.935836] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [ 4947.935836] [<ffffffffa09daf71>] target_recovery_thread+0x5d1/0x1360 [ptlrpc] [ 4947.935836] [<ffffffff8163b3d8>] ? __schedule+0x2d8/0x900 [ 4947.935836] [<ffffffffa09da9a0>] ? replay_request_or_update.isra.20+0x8c0/0x8c0 [ptlrpc] [ 4947.935836] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [ 4947.935836] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [ 4947.935836] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 |
| Comment by Di Wang [ 20/Jul/16 ] |
|
Hmm, it looks like MDT0 is doing scrub [16482.903022] OI_scrub S 000000000000d918 0 15926 2 0x00000080 [16482.903022] ffff88004f7b7bb0 0000000000000046 ffff880053a45080 ffff88004f7b7fd8 [16482.903022] ffff88004f7b7fd8 ffff88004f7b7fd8 ffff880053a45080 ffffffff81daba00 [16482.903022] ffff88004f7b7be0 0000000100f6f3a9 ffffffff81daba00 000000000000d918 [16482.903022] Call Trace: [16482.903022] [<ffffffff8163ba29>] schedule+0x29/0x70 [16482.903022] [<ffffffff81639685>] schedule_timeout+0x175/0x2d0 [16482.903022] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 [16482.903022] [<ffffffffa0c45852>] osd_scrub_next+0x2a2/0x3d0 [osd_ldiskfs] [16482.903022] [<ffffffffa0c497c5>] ? osd_scrub_exec+0x65/0x600 [osd_ldiskfs] [16482.903022] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 [16482.903022] [<ffffffffa0c4ae20>] osd_inode_iteration+0x310/0xd70 [osd_ldiskfs] [16482.903022] [<ffffffffa0c49760>] ? osd_ios_ROOT_scan+0x300/0x300 [osd_ldiskfs] [16482.903022] [<ffffffffa0c455b0>] ? osd_preload_next+0xc0/0xc0 [osd_ldiskfs] [16482.903022] [<ffffffffa0c4c1e0>] osd_scrub_main+0x960/0xf30 [osd_ldiskfs] [16482.903022] [<ffffffffa0c4b880>] ? osd_inode_iteration+0xd70/0xd70 [osd_ldiskfs] [16482.903022] [<ffffffff810a5aef>] kthread+0xcf/0xe0 [16482.903022] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 [16482.903022] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 [16482.903022] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 FanYong, Will this stop MDT0 accepting any connection? |
| Comment by nasf (Inactive) [ 25/Jul/16 ] |
|
The OI scrub has nothing related with connection operation, it only affects FID based operation (if no name lookup before that): if the OI mapping for the target FID is NOT ready, it will return -115 (-EINPROGRESS) to the caller. Generally, the caller or related RPC sponsor needs to retry related request or RPC some time later under such case. In this test, the OI scrub is expected. On the other hand, I wonder why the osp_remote_sync() was not sent out or replied? Since OI scrub is only local operation, it will not affect the ptlrpc layer behaviour. |
| Comment by nasf (Inactive) [ 26/Jul/16 ] |
Wangdi, you have set ptlrpc_request::rq_allow_replay unconditional for osp_remote_sync(), (although I do not think it is right under some cases). Then even though it was in recovery, the OUT RPC for update logs still can be sent out. Since it did not, is it possible anything wrong for the OSP import status? Or we need more work to adjust ptlrpc_import_delay_req() behaviour? |
| Comment by nasf (Inactive) [ 29/Jul/16 ] |
|
According to the original bug description, there were many "ping" RPC timeout on the peer MDT: 18:17:02:Lustre: 11672:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1460596315/real 1460596315] req@ffff88007ac75380 x1531531004437356/t0(0) o400->lustre-MDT0002-osp-MDT0000@0@lo:24/4 lens 224/224 e 1 to 1 dl 1460596317 ref 1 fl Rpc:X/c0/ffffffff rc 0/-1 That matches the scenario described in I would wait for Frank's feedback on the |
| Comment by nasf (Inactive) [ 06/Aug/16 ] |
|
It is the duplication of |