Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14741

Close RPC might get stuck behind normal RPCs waiting for slot

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • Lustre 2.12.6, Lustre 2.12.7, Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      It looks like obd_get_mod_rpc_slot places all RPCs waiting for a slot into a single exclusive waitq:

                     wait_event_idle_exclusive(cli->cl_mod_rpcs_waitq,
                                                obd_mod_rpc_slot_avail(cli,
                                                                       close_req));
              } while (true);
      }
      EXPORT_SYMBOL(obd_get_mod_rpc_slot);

      The problem is CLOSE RPCs have a higher chance of being sent. So if a CLOSE RPC completes and frees a slot the next item (only one) at the top of the waitq would be woken up and if it happens to be a non-close RPC, it'll go back to sleep and nothing would wake up the close rpc somewhere down the list.

      Normally this is not too much of a visible problem because the hope is eventually a normal RPC or a few will complete and the close. cpc will gets its turn, but sometimes the entire available queue is plugged on requests waiting on say an open lock that needs the close to finish first and if it's stuck down the list - we have a deadlock. This seems to be especially common with NFS servers, but could also manifest in master now that we added opencache on by default.

      We should either have separate waitqs or close/non-close RPCs or do wake_up_all() for completed CLOSE RPCs

      Attachments

        Issue Links

          Activity

            People

              green Oleg Drokin
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: