Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10931

failed peer discovery still taking too long

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • Lustre 2.13.0
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

      On master, when running conf-sanity I often see mount stuck in the following stack
      trace:

      n:lustre-release# stack1 llog
      29833 llog_process_th
      [<ffffffffc06be64b>] lnet_discover_peer_locked+0x10b/0x380 [lnet]
      [<ffffffffc06be930>] LNetPrimaryNID+0x70/0x1a0 [lnet]
      [<ffffffffc0990ade>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
      [<ffffffffc098518c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
      [<ffffffffc09580c2>] import_set_conn+0xb2/0x7a0 [ptlrpc]
      [<ffffffffc09587c3>] client_import_add_conn+0x13/0x20 [ptlrpc]
      [<ffffffffc074efa9>] class_add_conn+0x419/0x680 [obdclass]
      [<ffffffffc0750bc6>] class_process_config+0x19b6/0x27e0 [obdclass]
      [<ffffffffc0753644>] class_config_llog_handler+0x934/0x14d0 [obdclass]
      [<ffffffffc0717904>] llog_process_thread+0x834/0x1550 [obdclass]
      [<ffffffffc071902f>] llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
      [<ffffffff810b252f>] kthread+0xcf/0xe0
      [<ffffffff816b8798>] ret_from_fork+0x58/0x90
      [<ffffffffffffffff>] 0xffffffffffffffff
      

      conf-sanity has some tests that use bogus NIDs like 1.2.3.4 and 4.3.2.1.These are actually real IPv4 addresses but AFAICT they just discard all packets.I can see that the discovery thread cancels discovery on these peers but the llog_process_thread seems to stay in lnet_discover_peer_locked() for upto 60 seconds after. Looking at the code I can't see how it would get worken up in this case. Why doesn't lnet_peer_cancel_discovery() wake up the waiters on lp_dc_waitq? Or why don't we use schedule_timeout() with the discovery/transaction timeout in lnet_discover_peer_locked()?

            ashehata Amir Shehata (Inactive)
            jhammond John Hammond
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: