[LU-1600] lnet_nid2peer_locked() has race with shutting down of LNet Created: 05/Jul/12 Updated: 24/Jul/12 Resolved: 24/Jul/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Liang Zhen (Inactive) | Assignee: | Liang Zhen (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 4542 |
| Description |
|
lnet_nid2peer_locked()->lnet_find_peer_locked() will get NULL while LNet is in progress of shutting down, which means lnet_find_peer_locked() can allocate a new peer and try to insert it into peer table. If one thread dropped lock and allocating new peer, another thread could have already finalized everything of LNet, so the first thread will crash system because it will try to get lock and peer-table after allocation of peer. The simple solution is add an extra refcount on peer-table (number of peers) before allocating new peer, because the shutting down thread always needs to wait until peers number to be zero before going to the next step. This bug is not introduced by new LNet, but it can be exposed easily by new LNet. |
| Comments |
| Comment by Jodi Levi (Inactive) [ 05/Jul/12 ] |
|
Are you already looking at this one? |
| Comment by Liang Zhen (Inactive) [ 09/Jul/12 ] |
|
patch landed |
| Comment by Liang Zhen (Inactive) [ 10/Jul/12 ] |
|
I have to reopen it, the patch didn't fix all issues:
|
| Comment by Liang Zhen (Inactive) [ 11/Jul/12 ] |
|
the second patch is here: http://review.whamcloud.com/3369 |
| Comment by Peter Jones [ 23/Jul/12 ] |
|
Liang Can this now be marked as resolved? Peter |
| Comment by Liang Zhen (Inactive) [ 24/Jul/12 ] |
|
patch landed |