[LU-9204] replay-vbr test_1b: umount MDS hung in DNE Created: 10/Mar/17 Updated: 07/Jul/17 Resolved: 26/Jun/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/5e48e8d2-f92a-11e6-aa39-5254006e85c2. The sub-test test_1b failed with the following error: test failed to respond and timed out server/client lustre-master tag-2.9.53 el7 DNE with ZFS MDS console 13:06:28:[ 9555.377705] Lustre: DEBUG MARKER: == replay-vbr test 1b: open (O_CREAT) checks version of parent ======================================= 12:59:38 (1487681978) 13:06:28:[ 9556.168476] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdd.lustre-MDT0000.sync_permission=0 13:06:28:[ 9556.511353] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param mdt.lustre-MDT0000.commit_on_sharing=0 13:06:28:[ 9556.861773] Lustre: DEBUG MARKER: sync; sync; sync 13:06:28:[ 9557.944765] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno 13:06:28:[ 9558.286819] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly 13:06:28:[ 9558.541020] LustreError: 22420:0:(osd_handler.c:651:osd_ro()) lustre-MDT0000: *** setting device osd-zfs read-only *** 13:06:28:[ 9558.543353] LustreError: 22420:0:(osd_handler.c:651:osd_ro()) Skipped 7 previous similar messages 13:06:28:[ 9558.715652] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000 13:06:28:[ 9558.886272] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000 13:06:28:[ 9819.188656] Lustre: MGS: Received new LWP connection from 10.9.6.28@tcp, removing former export from same NID 13:06:28:[ 9819.751607] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts 13:06:28:[ 9820.090977] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1 13:06:28:[ 9820.265288] Lustre: Failing over lustre-MDT0000 13:06:28:[ 9820.267192] Lustre: Skipped 6 previous similar messages 13:06:28:[ 9820.634437] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.35@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 13:06:28:[ 9820.638958] LustreError: Skipped 321 previous similar messages 13:06:28:[ 9822.726157] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation obd_ping to node 0@lo failed: rc = -107 13:06:28:[ 9822.728445] LustreError: Skipped 3 previous similar messages 13:06:28:[ 9822.730302] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete 13:06:28:[ 9822.734323] Lustre: Skipped 3 previous similar messages 13:06:28:[ 9822.734585] LustreError: 166-1: MGC10.9.6.34@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 13:06:28:[ 9822.734586] LustreError: Skipped 1 previous similar message 13:06:28:[ 9822.734612] LustreError: 9478:0:(ldlm_request.c:148:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1487681946, 300s ago), entering recovery for MGS@10.9.6.34@tcp ns: MGC10.9.6.34@tcp lock: ffff88004b23bc00/0xe3098f4980534cd7 lrc: 4/1,0 mode: --/CR res: [0x65727473756c:0x4:0x0].0x0 rrc: 1 type: PLN flags: 0x1000000000000 nid: local remote: 0xe3098f4980534cde expref: -99 pid: 9478 timeout: 0 lvb_type: 0 13:06:28:[ 9828.751055] Lustre: 9455:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1487682246/real 1487682246] req@ffff880047458300 x1559937866081120/t0(0) o250->MGC10.9.6.34@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1487682252 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 13:06:28:[ 9828.758132] Lustre: 9455:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 5 previous similar messages 13:06:28:[ 9896.207670] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.28@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 13:06:28:[ 9896.212466] LustreError: Skipped 237 previous similar messages 13:06:28:[ 9960.081186] INFO: task umount:22669 blocked for more than 120 seconds. 13:06:28:[ 9960.083476] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 13:06:28: 13:06:28:[ 9960.085839] umount D ffff880039e8ba00 0 22669 22668 0x00000080 13:06:28:[ 9960.090108] ffff880039d43a10 0000000000000082 ffff880037519f60 ffff880039d43fd8 13:06:28:[ 9960.092505] ffff880039d43fd8 ffff880039d43fd8 ffff880037519f60 ffff880039e8bb28 13:12:33:[ 9960.094963] ffff880039e8bb30 7fffffffffffffff ffff880037519f60 ffff880039e8ba00 13:12:33:[ 9960.097448] Call Trace: 13:12:33:[ 9960.099437] [<ffffffff8168bac9>] schedule+0x29/0x70 13:12:33:[ 9960.101633] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 13:12:33:[ 9960.103812] [<ffffffff811daa99>] ? discard_slab+0x39/0x50 13:12:33:[ 9960.105961] [<ffffffff811dc58a>] ? __slab_free+0x2ca/0x2f0 13:12:33:[ 9960.108126] [<ffffffffa0ae88e1>] ? llog_free_handle+0x121/0x400 [obdclass] 13:12:33:[ 9960.110386] [<ffffffff8168bea6>] wait_for_completion+0x116/0x170 13:12:33:[ 9960.112518] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20 13:12:33:[ 9960.114676] [<ffffffffa0fd5aee>] mgs_ir_fini_fs+0x27e/0x2ec [mgs] 13:12:33:[ 9960.116802] [<ffffffffa0fca329>] mgs_put_fsdb+0x49/0x910 [mgs] 13:12:33:[ 9960.118949] [<ffffffffa0fcacfc>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs] 13:12:33:[ 9960.121089] [<ffffffffa0fb0a47>] mgs_device_fini+0x97/0x5b0 [mgs] 13:12:33:[ 9960.123266] [<ffffffffa0b23f94>] class_cleanup+0x7f4/0xd90 [obdclass] 13:12:33:[ 9960.125377] [<ffffffffa0b26f73>] class_process_config+0x2323/0x3340 [obdclass] 13:12:33:[ 9960.127605] [<ffffffffa0b16679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 13:12:33:[ 9960.129740] [<ffffffffa0b2807f>] class_manual_cleanup+0xef/0x810 [obdclass] 13:12:33:[ 9960.131945] [<ffffffffa0b58360>] server_put_super+0xb20/0xcd0 [obdclass] 13:12:33:[ 9960.134048] [<ffffffff812007f2>] generic_shutdown_super+0x72/0xf0 13:12:33:[ 9960.136150] [<ffffffff81200bc2>] kill_anon_super+0x12/0x20 13:12:33:[ 9960.138127] [<ffffffffa0b2bb92>] lustre_kill_super+0x32/0x50 [obdclass] 13:12:33:[ 9960.140217] [<ffffffff81200f79>] deactivate_locked_super+0x49/0x60 13:12:33:[ 9960.142205] [<ffffffff81201576>] deactivate_super+0x46/0x60 13:12:33:[ 9960.144173] [<ffffffff8121e9b5>] mntput_no_expire+0xc5/0x120 13:12:33:[ 9960.146059] [<ffffffff8121faf0>] SyS_umount+0xa0/0x3b0 13:12:33:[ 9960.147941] [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b 13:12:33:[10046.552431] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.6.29@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 13:12:33:[10046.556737] LustreError: Skipped 495 previous similar messages 13:12:33:[10080.149184] INFO: task umount:22669 blocked for more than 120 seconds. 13:12:33:[10080.151185] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 13:12:33:[10080.153204] umount D ffff880039e8ba00 0 22669 22668 0x00000080 13:12:33:[10080.155238] ffff880039d43a10 0000000000000082 ffff880037519f60 ffff880039d43fd8 13:12:33:[10080.157558] ffff880039d43fd8 ffff880039d43fd8 ffff880037519f60 ffff880039e8bb28 13:12:33:[10080.159579] ffff880039e8bb30 7fffffffffffffff ffff880037519f60 ffff880039e8ba00 13:12:33:[10080.161883] Call Trace: 13:12:33:[10080.163346] [<ffffffff8168bac9>] schedule+0x29/0x70 13:12:33:[10080.165282] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 13:12:33:[10080.166946] [<ffffffff811daa99>] ? discard_slab+0x39/0x50 13:12:33:[10080.168727] [<ffffffff811dc58a>] ? __slab_free+0x2ca/0x2f0 13:12:33:[10080.170347] [<ffffffffa0ae88e1>] ? llog_free_handle+0x121/0x400 [obdclass] 13:12:33:[10080.172222] [<ffffffff8168bea6>] wait_for_completion+0x116/0x170 13:12:33:[10080.173817] [<ffffffff810c4fd0>] ? wake_up_state+0x20/0x20 13:12:33:[10080.175549] [<ffffffffa0fd5aee>] mgs_ir_fini_fs+0x27e/0x2ec [mgs] 13:12:33:[10080.177123] [<ffffffffa0fca329>] mgs_put_fsdb+0x49/0x910 [mgs] 13:12:33:[10080.178850] [<ffffffffa0fcacfc>] mgs_cleanup_fsdb_list+0x6c/0x90 [mgs] 13:12:33:[10080.180440] [<ffffffffa0fb0a47>] mgs_device_fini+0x97/0x5b0 [mgs] 13:12:33:[10080.182902] [<ffffffffa0b23f94>] class_cleanup+0x7f4/0xd90 [obdclass] 13:12:33:[10080.184978] [<ffffffffa0b26f73>] class_process_config+0x2323/0x3340 [obdclass] 13:12:33:[10080.186644] [<ffffffffa0b16679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 13:12:33:[10080.188473] [<ffffffffa0b2807f>] class_manual_cleanup+0xef/0x810 [obdclass] 13:12:33:[10080.190134] [<ffffffffa0b58360>] server_put_super+0xb20/0xcd0 [obdclass] 13:12:33:[10080.191985] [<ffffffff812007f2>] generic_shutdown_super+0x72/0xf0 13:12:33:[10080.193641] [<ffffffff81200bc2>] kill_anon_super+0x12/0x20 13:12:33:[10080.195308] [<ffffffffa0b2bb92>] lustre_kill_super+0x32/0x50 [obdclass] 13:12:33:[10080.197089] [<ffffffff81200f79>] deactivate_locked_super+0x49/0x60 13:12:33:[10080.198671] [<ffffffff81201576>] deactivate_super+0x46/0x60 13:12:33:[10080.200426] [<ffffffff8121e9b5>] mntput_no_expire+0xc5/0x120 13:12:33:[10080.201976] [<ffffffff8121faf0>] SyS_umount+0xa0/0x3b0 13:12:33:[10080.203649] [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b 13:12:33:[10200.205111] INFO: task umount:22669 blocked for more than 120 seconds. 13:12:33:[10200.206778] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this Info required for matching: replay-vbr 1b |
| Comments |
| Comment by James Casper [ 28/Apr/17 ] |
|
This looks like a dupe of |
| Comment by Sarah Liu [ 26/Jun/17 ] |
|
dup of |