Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19928

import_select_connection() can sleep inside atomic

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      as reported in https://testing.whamcloud.com/gerrit-janitor/60685/testresults/conf-sanity-special6-ldiskfs-rocky8.10_x86_64-rocky8.10_x86_64/oleg245-client.syslog.log

      Feb 26 05:30:00 oleg245-client kernel: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:289
      Feb 26 05:30:00 oleg245-client kernel: in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 37, name: kworker/u8:2
      Feb 26 05:30:00 oleg245-client kernel: CPU: 0 PID: 37 Comm: kworker/u8:2 Kdump: loaded Tainted: G           O      -------- -  - 4.18.0rh8.10-debug #2
      Feb 26 05:30:00 oleg245-client kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-8.fc42 06/10/2025
      Feb 26 05:30:00 oleg245-client kernel: Workqueue: ptlrpc_pinger ptlrpc_pinger_main [ptlrpc]
      Feb 26 05:30:00 oleg245-client kernel: Call Trace:
      Feb 26 05:30:00 oleg245-client kernel: ? dump_stack+0xbb/0x10e
      Feb 26 05:30:00 oleg245-client kernel: ? ___might_sleep.cold.92+0xd9/0x107
      Feb 26 05:30:00 oleg245-client kernel: ? __might_sleep+0x59/0xc0
      Feb 26 05:30:00 oleg245-client kernel: ? mutex_lock+0x24/0x70
      Feb 26 05:30:00 oleg245-client kernel: ? lnet_peerni_by_nid_locked+0x7f/0x1c0 [lnet]
      Feb 26 05:30:00 oleg245-client kernel: ? LNetPeerDiscovered+0x78/0x460 [lnet]
      Feb 26 05:30:00 oleg245-client kernel: ? import_select_connection+0x2ad/0xed0 [ptlrpc]
      Feb 26 05:30:00 oleg245-client kernel: ? ptlrpc_connect_import_locked+0x49c/0x1070 [ptlrpc]
      Feb 26 05:30:00 oleg245-client kernel: ? rpc_make_runnable+0xb5/0xd0
      Feb 26 05:30:00 oleg245-client kernel: ? inet_recvmsg+0x81/0x180
      Feb 26 05:30:00 oleg245-client kernel: ? update_load_avg+0x9f/0xa40
      Feb 26 05:30:00 oleg245-client kernel: ? xs_poll_check_readable+0x38/0xb0
      Feb 26 05:30:00 oleg245-client kernel: ? ptlrpc_pinger_main+0x709/0xf20 [ptlrpc]
      Feb 26 05:30:00 oleg245-client kernel: ? process_one_work+0x2c8/0x700
      Feb 26 05:30:00 oleg245-client kernel: ? worker_thread+0x296/0x6e0
      Feb 26 05:30:00 oleg245-client kernel: ? rescuer_thread+0x570/0x570
      Feb 26 05:30:00 oleg245-client kernel: ? kthread+0x1d1/0x200
      Feb 26 05:30:00 oleg245-client kernel: ? set_kthread_struct+0x70/0x70
      Feb 26 05:30:00 oleg245-client kernel: ? ret_from_fork+0x1f/0x30
       

      this happens in import_select_connection() inside spinlock-protected loop across import connection when it calls LNetPeerDiscovered(). The latter was changed in past to use lnet_peerni_by_nid_locked() which in turn may sleep when takin mutex in slow path. This code need to be changed to don't refresh connection uptodate state inside loop.

      Possible way to fix that:

      • does first loop iteration without entering slow path, 
      • if all uptodated conn are tried, refresh remaining non-uptodated connection status by slow path without imp_lock taken

       

       

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              tappro Mikhail Pershin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: