Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13986

livelock is possible in distribute_txn_commit_thread()

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      A recent patch to update_trans.c changed how distribute_txn_thread() waited for more work to do.

      It previously had an explicit "wait_event()" which listed all the conditions to wait for. It would then recheck each condition and possibly perform an appropriate action.

      It was changed to check each condition only once (per loop). If the condition was true, the action would be performed and a flag set. If no conditions were true (indicated by flag), it would wait, otherwise it would loop and recheck all condition.

      One of the "if (condition) { do work }" stanzas in the loop tested a condition that was not a condition that should wake up the loop. "batchid" was not tested at all in the wait_event(). The flag mentioned above was, however, set when that condition tested true.
      This can cause the loop to spin indefinitely.

      The "__set_current_state(TASK_RUNNING);" should be removed so that the value
      of batchid cannot stop the loop from sleeping (calling 'schedule()').

      Attachments

        Issue Links

          Activity

            People

              neilb Neil Brown
              neilb Neil Brown
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: