Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7400

top_trans_create() followed by top_trans_stop() get stuck

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.8.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      as sub_thandle_register_stop_cb() is called in top_trans_start(), then missing top_trans_start() (which is valid case) cause top_trans_stop() to wait indefinitely for a missing stop callbacks.

      Attachments

        Activity

          [LU-7400] top_trans_create() followed by top_trans_stop() get stuck
          jgmitter Joseph Gmitter (Inactive) made changes -
          Resolution New: Fixed [ 1 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

          Landed for 2.8

          jgmitter Joseph Gmitter (Inactive) added a comment - Landed for 2.8

          Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17059/
          Subject: LU-7400 lod: register stop callbacks at create
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 082eabdeaa0c2a0f536accf7028e5ab5061c2c46

          gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17059/ Subject: LU-7400 lod: register stop callbacks at create Project: fs/lustre-release Branch: master Current Patch Set: Commit: 082eabdeaa0c2a0f536accf7028e5ab5061c2c46
          pjones Peter Jones made changes -
          Assignee Original: WC Triage [ wc-triage ] New: Alex Zhuravlev [ bzzz ]
          jgmitter Joseph Gmitter (Inactive) made changes -
          Link New: This issue is related to JFC-14 [ JFC-14 ]
          di.wang Di Wang made changes -
          Priority Original: Minor [ 4 ] New: Blocker [ 1 ]
          di.wang Di Wang added a comment -

          Since we saw this in soak-test DNE, and we have to pass it before release, so let's make it blocker for now.

          di.wang Di Wang added a comment - Since we saw this in soak-test DNE, and we have to pass it before release, so let's make it blocker for now.
          di.wang Di Wang added a comment -

          I think we also have this issue for commit callback, which I found in failover soak-test. I will update the patch to resolve them together. IMHO, this should get into 2.8, since it will cause endless recovery.

          di.wang Di Wang added a comment - I think we also have this issue for commit callback, which I found in failover soak-test. I will update the patch to resolve them together. IMHO, this should get into 2.8, since it will cause endless recovery.

          for example, if a target fails during recovery and some of preparation RPC (e.g. fetching EA) returns an error, then the original migration process got stuck. even if the failed target is back, it's still stuck. I think it makes sense to consider landing.

          bzzz Alex Zhuravlev added a comment - for example, if a target fails during recovery and some of preparation RPC (e.g. fetching EA) returns an error, then the original migration process got stuck. even if the failed target is back, it's still stuck. I think it makes sense to consider landing.

          Alex, under what kind of workload is this bug hit, and how easily does that happen? Is this a patch that needs to be landed for 2.8.0?

          adilger Andreas Dilger added a comment - Alex, under what kind of workload is this bug hit, and how easily does that happen? Is this a patch that needs to be landed for 2.8.0?

          People

            bzzz Alex Zhuravlev
            bzzz Alex Zhuravlev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: