Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2104

conf-sanity test 47 never completes, negative time to recovery

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 4
    • 4392

    Description

      Running conf-sanity test 47 I met this:

      Oct  7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s
      anity test 47: server restart does not make client loss lru_resize settings == 2
      0:52:02 (1349657522)
      Oct  7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing
      Oct  7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000
      Oct  7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio
      ns using this service will fail.Oct  7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0
      000: Connection restored to lustre-OST0000 (at 0@lo)Oct  7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067:
      ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct  7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT
      0000Oct  7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD
      T0000 completeOct  7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
      Oct  7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53
      ...
      Oct  7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
      Oct  7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message
      Oct  7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27
      Oct  7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
      Oct  7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages
      Oct  7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16
      Oct  7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages
      Oct  7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52
      ...
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.466011]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.466835]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.467658]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace:
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.468913]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.469451]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.470270]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.470765]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.471287]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.472065]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.472594]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473336]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473550]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473889]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.474239]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473577]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473931]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474288]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace:
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474820]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475049]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475403]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475620]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475845]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476186]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476399]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476759]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477017]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477446]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477801]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Oct  7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying
       connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12
      ...
      Oct  7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27
      

      So recovery timer epiring multiple times and happily counting into negative territory.
      Also I find it strange that the hangcheck timer hits multiple times for the same task so quickly before bigger pauses.

      Attachments

        Issue Links

          Activity

            [LU-2104] conf-sanity test 47 never completes, negative time to recovery

            patch was landed

            tappro Mikhail Pershin added a comment - patch was landed

            I'm fine resolving this since the patch landed, we don't really have a reproducer and haven't seen it in the wild since applying the fix. We can reopen if needed.

            prakash Prakash Surya (Inactive) added a comment - I'm fine resolving this since the patch landed, we don't really have a reproducer and haven't seen it in the wild since applying the fix. We can reopen if needed.

            patch landed

            tappro Mikhail Pershin added a comment - patch landed

            Reboot is a case too, it changes MDS connection as well, the case we need - MDS restart during OST recovery

            tappro Mikhail Pershin added a comment - Reboot is a case too, it changes MDS connection as well, the case we need - MDS restart during OST recovery

            We are planning to use failover eventually on this filesystem. This will be the first filesystem at LLNL where we will use MDS failover. So it may be configured with a failover nid already. But I too am skeptical that anyone really did MDS failover on this system. We're using the other MDS node for another purpose at the moment.

            morrone Christopher Morrone (Inactive) added a comment - We are planning to use failover eventually on this filesystem. This will be the first filesystem at LLNL where we will use MDS failover. So it may be configured with a failover nid already. But I too am skeptical that anyone really did MDS failover on this system. We're using the other MDS node for another purpose at the moment.

            Also I suppose you've used MDS failover and that is the key.

            What do you mean by "MDS failover"? Unless I'm mistaken, we do not use MDS failover. When upgrading or during testing we may reboot the MDS, but it never fails over to a partner. I would not expect the NID to change since the MDS is comes back up on the same node in our configuration.

            prakash Prakash Surya (Inactive) added a comment - Also I suppose you've used MDS failover and that is the key. What do you mean by "MDS failover"? Unless I'm mistaken, we do not use MDS failover. When upgrading or during testing we may reboot the MDS, but it never fails over to a partner. I would not expect the NID to change since the MDS is comes back up on the same node in our configuration.

            Yes, I expect fix to help. Bug says about test 47 conf-sanity, I believe it can reproduce. Also I suppose you've used MDS failover and that is the key. Problem was MDS reconnection from different nid during OST recovery, so you can try to simulate.

            tappro Mikhail Pershin added a comment - Yes, I expect fix to help. Bug says about test 47 conf-sanity, I believe it can reproduce. Also I suppose you've used MDS failover and that is the key. Problem was MDS reconnection from different nid during OST recovery, so you can try to simulate.

            Mike, sure I'll apply the updated patch. Do you expect it to fix the issue completely? I ask because we don't have a solid reproducer, basically we've just rebooted the OSTs many times and "eventually" we'll see the negative time.

            prakash Prakash Surya (Inactive) added a comment - Mike, sure I'll apply the updated patch. Do you expect it to fix the issue completely? I ask because we don't have a solid reproducer, basically we've just rebooted the OSTs many times and "eventually" we'll see the negative time.

            Prakash, can you check the latest patch if that is easy to reproduce?

            tappro Mikhail Pershin added a comment - Prakash, can you check the latest patch if that is easy to reproduce?

            Another change in patch is related to class_fail_export() function. If it is called during recovery then it should update obd_stale_clients counter. Patch is updated.

            tappro Mikhail Pershin added a comment - Another change in patch is related to class_fail_export() function. If it is called during recovery then it should update obd_stale_clients counter. Patch is updated.

            People

              tappro Mikhail Pershin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: