Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19976

osc_init_grant ignores cl_lost_grant on reconnect, causing grant inflation

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • 3
    • 9223372036854775807

      osc_init_grant() does not zero cl_lost_grant when computing available grant on reconnect. This causes the client to report a grant total that exceeds the server-authorised amount.

      *How it happens:*

      During an eviction and reconnect cycle, dirty pages that fail to flush are accounted via osc_free_grant(): the grant moves from cl_dirty_grant into cl_lost_grant. When osc_reconnect() fires, it zeroes cl_lost_grant and reports the current dirty+reserved totals to the server in the CONNECT RPC. However, if more RPCs fail between osc_reconnect() and the subsequent IMP_EVENT_OCD (which calls osc_init_grant()), cl_lost_grant accumulates again.

      osc_init_grant() then sets:
      cl_avail_grant = ocd_grant - cl_dirty_grant - cl_reserved_grant

      But it does not zero cl_lost_grant. The already-drained grants are double-counted: they reduced cl_dirty_grant (so cl_avail_grant gains that space), while also remaining in cl_lost_grant. The client's view of total grant becomes:
      avail + dirty + reserved + lost > ocd_grant

      *Fix:*

      In osc_init_grant(), zero cl_lost_grant after computing cl_avail_grant. The lost grants from the old connection were either reported to the server in osc_reconnect() or discarded; they must not carry over into the new connection's accounting.

      Affected function: osc_init_grant() in lustre/osc/osc_request.c

      *Discovery:*

      Found via a TLA+ formal model of the OSC grant eviction and reconnect protocol. The model checker (TLC) produced a 23-state counterexample demonstrating the inflation path.

            wc-triage WC Triage
            paf0186 Patrick Farrell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: