Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2104

conf-sanity test 47 never completes, negative time to recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 4
    • 4392

    Description

      Running conf-sanity test 47 I met this:

      Oct  7 20:52:02 centos6-0 kernel: [ 3555.765931] Lustre: DEBUG MARKER: == conf-s
      anity test 47: server restart does not make client loss lru_resize settings == 2
      0:52:02 (1349657522)
      Oct  7 20:52:05 centos6-0 kernel: [ 3558.898945] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:05 centos6-0 kernel: [ 3558.974524] Lustre: lustre-OST0000: new disk, initializing
      Oct  7 20:52:17 centos6-0 kernel: [ 3570.933331] Lustre: Failing over lustre-OST0000
      Oct  7 20:52:17 centos6-0 kernel: [ 3570.980875] Lustre: server umount lustre-OST0000 complete
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.333586] LDISKFS-fs (loop3): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.368642] Lustre: Found index 0 for lustre-OST0000, updating log
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.387882] Lustre: 1986:0:(ofd_fs.c:271:ofd_groups_init()) lustre-OST0000: 1 groups initialized
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.399862] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.400829] LustreError: Skipped 2 previous similar messages
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.401374] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.403079] Lustre: 1943:0:(ldlm_lib.c:2163:target_recovery_init()) RECOVERY: service lustre-OST0000, 2 recoverable clients, last_transno 4294967296
      Oct  7 20:52:28 centos6-0 kernel: [ 3581.403522] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operatio
      ns using this service will fail.Oct  7 20:52:28 centos6-0 kernel: [ 3581.403717] Lustre: lustre-OST0000-osc-MDT0
      000: Connection restored to lustre-OST0000 (at 0@lo)Oct  7 20:52:28 centos6-0 kernel: [ 3581.403980] Lustre: 1961:0:(ofd_obd.c:1067:
      ofd_orphans_destroy()) lustre-OST0000: deleting orphan objects from 4 to 34Oct  7 20:52:28 centos6-0 kernel: [ 3581.571352] Lustre: Failing over lustre-MDT
      0000Oct  7 20:52:28 centos6-0 kernel: [ 3581.810938] Lustre: server umount lustre-MD
      T0000 completeOct  7 20:52:33 centos6-0 kernel: [ 3586.396833] LustreError: 11-0: an error occurred while communicating with 0@lo. The obd_ping operation failed with -107
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.152958] LDISKFS-fs (loop2): mounted filesystem with ordered data mode. quota=on. Opts: 
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.181514] LustreError: 166-1: MGC192.168.10.210@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.182155] Lustre: MGC192.168.10.210@tcp: Reactivating import
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.184590] Lustre: Found index 0 for lustre-MDT0000, updating log
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.190451] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.190908] Lustre: Skipped 4 previous similar messages
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.221858] Lustre: lustre-MDT0000: used disk, loading
      Oct  7 20:52:38 centos6-0 kernel: [ 3592.222241] LustreError: 2169:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
      Oct  7 20:52:39 centos6-0 kernel: [ 3592.423088] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:53
      ...
      Oct  7 20:53:15 centos6-0 kernel: [ 3628.401357] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
      Oct  7 20:53:15 centos6-0 kernel: [ 3628.402837] Lustre: Skipped 1 previous similar message
      Oct  7 20:53:33 centos6-0 kernel: [ 3646.396098] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:53:35 centos6-0 kernel: [ 3648.401655] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:27
      Oct  7 20:55:03 centos6-0 kernel: [ 3736.396084] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:55:15 centos6-0 kernel: [ 3748.401558] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in 0:17
      Oct  7 20:55:15 centos6-0 kernel: [ 3748.403016] Lustre: Skipped 12 previous similar messages
      Oct  7 20:55:33 centos6-0 kernel: [ 3766.396141] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      Oct  7 20:57:00 centos6-0 kernel: [ 3853.401873] LustreError: 11-0: an error occurred while communicating with 0@lo. The ost_connect operation failed with -16
      Oct  7 20:57:00 centos6-0 kernel: [ 3853.402813] LustreError: Skipped 51 previous similar messages
      Oct  7 20:57:25 centos6-0 kernel: [ 3878.401914] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -1:-52
      ...
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.464110] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.464691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.465493] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.466011]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.466835]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.467658]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.468478] Call Trace:
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.468913]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.469451]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.470270]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.470765]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.471287]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.472065]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.472594]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473336]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473550]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.473889]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 20:58:47 centos6-0 kernel: [ 3960.474239]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.472105] INFO: task tgt_recov:1991 blocked for more than 120 seconds.
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.472691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473353] tgt_recov     D 0000000000000002  6256  1991      2 0x00000080
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473577]  ffff880031ddde00 0000000000000046 0000000000000000 000000000000001e
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.473931]  000000000000001e 0000000000000005 ffff880031ddddd0 0000000000000282
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474288]  ffff8800728da778 ffff880031dddfd8 000000000000fba8 ffff8800728da778
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474642] Call Trace:
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.474820]  [<ffffffffa1185180>] ? check_for_clients+0x0/0x90 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475049]  [<ffffffffa1186b6d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475403]  [<ffffffffa1184f80>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475620]  [<ffffffff8108fd60>] ? autoremove_wake_function+0x0/0x40
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.475845]  [<ffffffffa118dc3d>] target_recovery_thread+0x45d/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476186]  [<ffffffff814faeee>] ? _spin_unlock_irq+0xe/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476399]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.476759]  [<ffffffff8100c14a>] child_rip+0xa/0x20
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477017]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477446]  [<ffffffffa118d7e0>] ? target_recovery_thread+0x0/0x1660 [ptlrpc]
      Oct  7 21:00:47 centos6-0 kernel: [ 4080.477801]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Oct  7 21:01:45 centos6-0 kernel: [ 4138.401631] Lustre: lustre-OST0000: Denying
       connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -6:-12
      ...
      Oct  7 22:31:00 centos6-0 kernel: [ 9493.401642] Lustre: lustre-OST0000: Denying connection for new client lustre-MDT0000-mdtlov_UUID (at 0@lo), waiting for all 2 known clients (0 recovered, 1 in progress, and 1 unseen) to recover in -95:-27
      

      So recovery timer epiring multiple times and happily counting into negative territory.
      Also I find it strange that the hangcheck timer hits multiple times for the same task so quickly before bigger pauses.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: