Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15860

ksocknal_add_peer() race results in extra ksock_conn_cb

    XMLWordPrintable

Details

    • 3
    • 9223372036854775807

    Description

      Seems there is a race where two ksock_conn_cb can be created:

      Bad conn_cb
      00000800:00000010:19.0:1652717291.344361:0:4361:0:(socklnd.c:170:ksocknal_create_peer()) alloc '(peer_ni)': 240 at ffff9215aa86b700 (tot 41715364).
      00000800:00000010:19.0:1652717291.344362:0:4361:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aa86b600 (tot 41715564).
      00000800:00000200:19.1:1652717291.344363:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 1
      00000800:00000200:19.1:1652717291.344364:0:4361:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 2
      00000800:00000200:19.0:1652717291.344365:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
      00000800:00000200:19.1:1652717291.344366:0:4361:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
      00000800:00000200:19.1:1652717291.344367:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aa86b600 1
      00000800:00000200:19.1:1652717291.344368:0:4361:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aa86b600 2
      
      Good conn_cb
      00000800:00000010:16.0:1652717291.344365:0:4360:0:(socklnd.c:119:ksocknal_create_conn_cb()) alloc '(conn_cb)': 200 at ffff9215aca22600 (tot 41715508).
      ...
      00000800:00000200:16.1:1652717291.344371:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (2)
      00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) pre ffff9215aa86b700 2
      00000800:00000200:16.1:1652717291.344375:0:4360:0:(socklnd.c:556:ksocknal_add_conn_cb_locked()) post ffff9215aa86b700 3
      00000800:00000200:16.0:1652717291.344377:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3)
      00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd.c:239:ksocknal_find_peer_locked()) got peer_ni [ffff9215aa86b700] -> 12345-172.18.2.8@tcp (3)
      00000800:00000200:16.1:1652717291.344378:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) pre ffff9215aca22600 1
      00000800:00000200:16.1:1652717291.344379:0:4360:0:(socklnd_cb.c:645:ksocknal_launch_connection_locked()) post ffff9215aca22600 2
      

      The second one overwrites the first in ksocknal_add_peer()->ksocknal_add_conn_cb_locked(). The first one gets stuck and is never freed on shutdown.

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: