Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10931

failed peer discovery still taking too long

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.13.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      On master, when running conf-sanity I often see mount stuck in the following stack
      trace:

      n:lustre-release# stack1 llog
      29833 llog_process_th
      [<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
      [<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet]
      [<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
      [<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
      [<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
      [<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc]
      [<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass]
      [<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass]
      [<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass]
      [<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass]
      [<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
      [<ffffffff810b252f>] kthread+0xcf/0xe0
      [<ffffffff816b8798>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()?

      Attachments

        Issue Links

          Activity

            [LU-10931] failed peer discovery still taking too long
            adilger Andreas Dilger made changes -
            Link New: This issue is related to DDN-1514 [ DDN-1514 ]
            jamesanunez James Nunez (Inactive) made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Closed [ 6 ]
            jamesanunez James Nunez (Inactive) made changes -
            Labels Original: always_except
            jamesanunez James Nunez (Inactive) made changes -
            Resolution Original: Fixed [ 1 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]
            pjones Peter Jones made changes -
            Resolution New: Fixed [ 1 ]
            Status Original: Reopened [ 4 ] New: Resolved [ 5 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.13.0 [ 14290 ]
            jamesanunez James Nunez (Inactive) made changes -
            Link New: This issue is related to LU-12519 [ LU-12519 ]
            jamesanunez James Nunez (Inactive) made changes -
            Labels New: always_except
            jamesanunez James Nunez (Inactive) made changes -
            Priority Original: Minor [ 4 ] New: Critical [ 2 ]
            jamesanunez James Nunez (Inactive) made changes -
            Resolution Original: Duplicate [ 3 ]
            Status Original: Resolved [ 5 ] New: Reopened [ 4 ]

            People

              ashehata Amir Shehata (Inactive)
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: