[LU-12834] MDT hung during failover Created: 07/Oct/19 Updated: 19/Sep/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Hongchao Zhang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lustre-master-ib #328 EL7.7 |
||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
1 MDT got stuck after it came back from a reboot. This issue looks like [2019-10-04T18:13:18+00:00] INFO: template[/etc/ssh/sshd_config] sending restart action to service[sshd] (delayed) [2019-10-04T18:13:18+00:00] INFO: Processing service[sshd] action restart (ssh::server line 19) [2019-10-04T18:13:18+00:00] INFO: service[sshd] restarted [2019-10-04T18:13:18+00:00] INFO: Chef Run complete in 40.090920258 seconds [2019-10-04T18:13:18+00:00] INFO: Running report handlers [2019-10-04T18:13:18+00:00] INFO: Creating JSON run report [2019-10-04T18:13:18+00:00] INFO: Report handlers complete [ 238.881257] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 238.891233] alg: No test for adler32 (adler32-zlib) [ 239.698265] Lustre: Lustre: Build Version: 2.12.58_104_g279c264 [ 239.880390] LNet: Using FMR for registration [ 239.882506] LNetError: 215:0:(o2iblnd_cb.c:2496:kiblnd_passive_connect()) Can't accept conn from 192.168.1.118@o2ib on NA (ib0:0:192.168.1.109): bad dst nid 192.168.1.109@o2ib [ 239.915188] LNet: Added LNI 192.168.1.109@o2ib [8/256/0/180] [ 239.969648] LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:321: MMP interval 42 higher than expected, please wait. [ 239.969648] [ 292.347499] LDISKFS-fs (dm-1): recovery complete [ 292.353139] LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,user_xattr,no_mbcache,nodelalloc [ 293.137144] Lustre: osd-ldiskfs create tunables for soaked-MDT0001 [ 293.641326] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 299.332695] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 28 clients reconnect [ 300.344031] Lustre: soaked-MDT0001: Connection restored to 431caaf2-d303-4 (at 192.168.1.136@o2ib) [ 301.115589] Lustre: soaked-MDT0001: Connection restored to 81b942e1-4e72-4 (at 192.168.1.125@o2ib) [ 302.209118] Lustre: soaked-MDT0001: Connection restored to d81c2840-9c51-4 (at 192.168.1.128@o2ib) [ 302.219237] Lustre: Skipped 2 previous similar messages [ 305.751655] Lustre: soaked-MDT0001: Connection restored to 6780d2ed-c1f6-4 (at 192.168.1.119@o2ib) [ 305.761769] Lustre: Skipped 5 previous similar messages [ 312.381584] Lustre: soaked-MDT0001: Connection restored to 156ff267-111c-4 (at 192.168.1.122@o2ib) [ 312.391687] Lustre: Skipped 2 previous similar messages [ 320.759167] Lustre: soaked-MDT0001: Connection restored to 925c0ae9-c415-4 (at 192.168.1.126@o2ib) [ 320.769289] Lustre: Skipped 4 previous similar messages [ 338.178116] Lustre: soaked-MDT0001: Connection restored to soaked-MDT0001-lwp-OST0008_UUID (at 192.168.1.104@o2ib) [ 338.189826] Lustre: Skipped 21 previous similar messages [ 348.048244] Lustre: soaked-MDT0001: Recovery over after 0:49, of 28 clients 28 recovered and 0 were evicted. [ 548.286678] Lustre: mdt00_005: service thread pid 5290 was inactive for 200.215 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one. [ 548.286697] Lustre: mdt01_004: service thread pid 5269 was inactive for 200.229 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 548.286701] Pid: 5271, comm: mdt01_006 3.10.0-1062.el7_lustre.x86_64 #1 SMP Mon Sep 30 22:06:44 UTC 2019 [ 548.286702] Lustre: Skipped 5 previous similar messages [ 548.286703] Call Trace: [ 548.286811] [<ffffffffc1035b10>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 548.286864] [<ffffffffc1037caf>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc] [ 548.286917] [<ffffffffc103a561>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc] [ 548.286935] [<ffffffffc1655d62>] osp_md_object_lock+0x162/0x2d0 [osp] [ 548.286959] [<ffffffffc1566974>] lod_object_lock+0xf4/0x780 [lod] [ 548.286980] [<ffffffffc15ebbfe>] mdd_object_lock+0x3e/0xe0 [mdd] [ 548.287009] [<ffffffffc14847d1>] mdt_remote_object_lock_try+0x1e1/0x520 [mdt] [ 548.287028] [<ffffffffc1484b3a>] mdt_remote_object_lock+0x2a/0x30 [mdt] [ 548.287050] [<ffffffffc149947e>] mdt_rename_lock+0xbe/0x4b0 [mdt] [ 548.287071] [<ffffffffc149ad75>] mdt_reint_rename+0x2c5/0x2b60 [mdt] [ 548.287092] [<ffffffffc14a6883>] mdt_reint_rec+0x83/0x210 [mdt] [ 548.287110] [<ffffffffc1480930>] mdt_reint_internal+0x7b0/0xba0 [mdt] [ 548.287129] [<ffffffffc148be37>] mdt_reint+0x67/0x140 [mdt] [ 548.287216] [<ffffffffc10d772a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 548.287278] [<ffffffffc1079976>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 548.287338] [<ffffffffc107d4ac>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 548.287345] [<ffffffff838c50d1>] kthread+0xd1/0xe0 [ 548.287350] [<ffffffff83f8bd37>] ret_from_fork_nospec_end+0x0/0x39 [ 548.287383] [<ffffffffffffffff>] 0xffffffffffffffff [ 548.287386] Pid: 5270, comm: mdt01_005 3.10.0-1062.el7_lustre.x86_64 #1 SMP Mon Sep 30 22:06:44 UTC 2019 [ 548.287387] Call Trace: [ 548.287463] [<ffffffffc1035b10>] ldlm_completion_ast+0x430/0x860 [ptlrpc] [ 548.287516] [<ffffffffc1037caf>] ldlm_cli_enqueue_fini+0x96f/0xdf0 [ptlrpc] [ 548.287568] [<ffffffffc103a561>] ldlm_cli_enqueue+0x421/0x930 [ptlrpc] [ 548.287582] [<ffffffffc1655d62>] osp_md_object_lock+0x162/0x2d0 [osp] [ 548.287599] [<ffffffffc1566974>] lod_object_lock+0xf4/0x780 [lod] [ 548.287614] [<ffffffffc15ebbfe>] mdd_object_lock+0x3e/0xe0 [mdd] [ 548.287634] [<ffffffffc14847d1>] mdt_remote_object_lock_try+0x1e1/0x520 [mdt] [ 548.287678] [<ffffffffc1484b3a>] mdt_remote_object_lock+0x2a/0x30 [mdt] [ 548.287701] [<ffffffffc149947e>] mdt_rename_lock+0xbe/0x4b0 [mdt] [ 548.287722] [<ffffffffc149ad75>] mdt_reint_rename+0x2c5/0x2b60 [mdt] [ 548.287744] [<ffffffffc14a6883>] mdt_reint_rec+0x83/0x210 [mdt] [ 548.287764] [<ffffffffc1480930>] mdt_reint_internal+0x7b0/0xba0 [mdt] [ 548.287784] [<ffffffffc148be37>] mdt_reint+0x67/0x140 [mdt] [ 548.287863] [<ffffffffc10d772a>] tgt_request_handle+0x98a/0x1630 [ptlrpc] [ 548.287930] [<ffffffffc1079976>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [ 548.287996] [<ffffffffc107d4ac>] ptlrpc_main+0xbac/0x1540 [ptlrpc] [ 548.288001] [<ffffffff838c50d1>] kthread+0xd1/0xe0 [ 548.288005] [<ffffffff83f8bd37>] ret_from_fork_nospec_end+0x0/0x39 [ 548.288017] [<ffffffffffffffff>] 0xffffffffffffffff |
| Comments |
| Comment by Peter Jones [ 08/Oct/19 ] |
|
Hongchao Can you please investigate? Thanks Peter |
| Comment by Hongchao Zhang [ 09/Oct/19 ] |
|
It could be related to |
| Comment by Peter Jones [ 09/Oct/19 ] |
|
hongchao.zhang you mean that the fix is not 100% effective? |
| Comment by Hongchao Zhang [ 09/Oct/19 ] |
|
the patch https://review.whamcloud.com/#/c/34410/ in |
| Comment by Peter Jones [ 09/Oct/19 ] |
|
Ah I see. So you are suggesting that we utilize this option on soak? |
| Comment by Hongchao Zhang [ 09/Oct/19 ] |
|
Yes, it would be better to disable rename between MDTs before it's fixed thoroughly. |
| Comment by Sarah Liu [ 09/Oct/19 ] |
|
Got it, I will disable it |
| Comment by Andreas Dilger [ 05/Nov/21 ] |
|
According to comments in I would recommend to re-enable the remote rename on soak, so that we can see if this issue is fixed properly, since disabling remote rename is at best a temporary workaround. |
| Comment by Andreas Dilger [ 05/Nov/21 ] |
|
Hmm, it looks like the |