Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10931

failed peer discovery still taking too long

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      On master, when running conf-sanity I often see mount stuck in the following stack
      trace:

      n:lustre-release# stack1 llog
      29833 llog_process_th
      [<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
      [<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet]
      [<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
      [<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
      [<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
      [<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc]
      [<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass]
      [<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass]
      [<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass]
      [<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass]
      [<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
      [<ffffffff810b252f>] kthread+0xcf/0xe0
      [<ffffffff816b8798>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()?

      Attachments

        Issue Links

          Activity

            [LU-10931] failed peer discovery still taking too long

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45898/
            Subject: LU-10931 lnet: handle unlink before send completes
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: cd3038b769ba1b7c5a4888ad84bdf03ecf51c709

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45898/ Subject: LU-10931 lnet: handle unlink before send completes Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: cd3038b769ba1b7c5a4888ad84bdf03ecf51c709

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45898
            Subject: LU-10931 lnet: handle unlink before send completes
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 3fa5fc5e6df44968f7503e4fa7cd555d5d24d32e

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45898 Subject: LU-10931 lnet: handle unlink before send completes Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 3fa5fc5e6df44968f7503e4fa7cd555d5d24d32e

            recovery-small test 136 is running again for the master branch.

            jamesanunez James Nunez (Inactive) added a comment - recovery-small test 136 is running again for the master branch.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35949/
            Subject: LU-10931 tests: resume testing of recovery-small 136
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ce47ba3d0983e341f7c9a62da5e851933ff4f307

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35949/ Subject: LU-10931 tests: resume testing of recovery-small 136 Project: fs/lustre-release Branch: master Current Patch Set: Commit: ce47ba3d0983e341f7c9a62da5e851933ff4f307

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35949
            Subject: LU-10931 tests: resume testing of recovery-small 136
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 1555963b1ce16f99460c153d7cadf133864587d0

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35949 Subject: LU-10931 tests: resume testing of recovery-small 136 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1555963b1ce16f99460c153d7cadf133864587d0

            Reopening this ticket because recovery-small test 136 is still on the ALWAYS_EXCEPT list. We need to submit a patch to remove this test from the list and confirm that the patch fixes the issues reveled by this test.

            jamesanunez James Nunez (Inactive) added a comment - Reopening this ticket because recovery-small test 136 is still on the ALWAYS_EXCEPT list. We need to submit a patch to remove this test from the list and confirm that the patch fixes the issues reveled by this test.
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35444/
            Subject: LU-10931 lnet: handle unlink before send completes
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d8fc5c23fe541e0ff6ce5bec6302957714c3f69f

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35444/ Subject: LU-10931 lnet: handle unlink before send completes Project: fs/lustre-release Branch: master Current Patch Set: Commit: d8fc5c23fe541e0ff6ce5bec6302957714c3f69f

            Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35444
            Subject: LU-10931 lnet: handle LNetMDUnlink case
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 97594d111f7bee96e056025dcad7f0d9bac91369

            gerrit Gerrit Updater added a comment - Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35444 Subject: LU-10931 lnet: handle LNetMDUnlink case Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 97594d111f7bee96e056025dcad7f0d9bac91369

            If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. What's happening here is the response times out before the send completes. So we have a refcount on the MD. The Unlink gets dropped on the floor. Send completes, but because we've already timed out the REPLY for the GET is dropped. Now we're left with a peer that is in:
            LNET_PEER_MULTI_RAIL
            LNET_PEER_DISCOVERING
            LNET_PEER_PING_SENT
            But no more events are coming to it, and the discovery never completes

            ashehata Amir Shehata (Inactive) added a comment - If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. What's happening here is the response times out before the send completes. So we have a refcount on the MD. The Unlink gets dropped on the floor. Send completes, but because we've already timed out the REPLY for the GET is dropped. Now we're left with a peer that is in: LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERING LNET_PEER_PING_SENT But no more events are coming to it, and the discovery never completes

            People

              ashehata Amir Shehata (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: