[LU-7400] top_trans_create() followed by top_trans_stop() get stuck Created: 06/Nov/15 Updated: 02/Dec/15 Resolved: 02/Dec/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alex Zhuravlev | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
as sub_thandle_register_stop_cb() is called in top_trans_start(), then missing top_trans_start() (which is valid case) cause top_trans_stop() to wait indefinitely for a missing stop callbacks. |
| Comments |
| Comment by Gerrit Updater [ 06/Nov/15 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/17059 |
| Comment by Andreas Dilger [ 06/Nov/15 ] |
|
Alex, under what kind of workload is this bug hit, and how easily does that happen? Is this a patch that needs to be landed for 2.8.0? |
| Comment by Alex Zhuravlev [ 06/Nov/15 ] |
|
for example, if a target fails during recovery and some of preparation RPC (e.g. fetching EA) returns an error, then the original migration process got stuck. even if the failed target is back, it's still stuck. I think it makes sense to consider landing. |
| Comment by Di Wang [ 12/Nov/15 ] |
|
I think we also have this issue for commit callback, which I found in failover soak-test. I will update the patch to resolve them together. IMHO, this should get into 2.8, since it will cause endless recovery. |
| Comment by Di Wang [ 12/Nov/15 ] |
|
Since we saw this in soak-test DNE, and we have to pass it before release, so let's make it blocker for now. |
| Comment by Gerrit Updater [ 02/Dec/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/17059/ |
| Comment by Joseph Gmitter (Inactive) [ 02/Dec/15 ] |
|
Landed for 2.8 |