[LU-1600] lnet_nid2peer_locked() has race with shutting down of LNet Created: 05/Jul/12  Updated: 24/Jul/12  Resolved: 24/Jul/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: Lustre 2.3.0

Type: Bug Priority: Blocker
Reporter: Liang Zhen (Inactive) Assignee: Liang Zhen (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4542

 Description   

lnet_nid2peer_locked()->lnet_find_peer_locked() will get NULL while LNet is in progress of shutting down, which means lnet_find_peer_locked() can allocate a new peer and try to insert it into peer table. If one thread dropped lock and allocating new peer, another thread could have already finalized everything of LNet, so the first thread will crash system because it will try to get lock and peer-table after allocation of peer.

The simple solution is add an extra refcount on peer-table (number of peers) before allocating new peer, because the shutting down thread always needs to wait until peers number to be zero before going to the next step.

This bug is not introduced by new LNet, but it can be exposed easily by new LNet.



 Comments   
Comment by Jodi Levi (Inactive) [ 05/Jul/12 ]

Are you already looking at this one?

Comment by Liang Zhen (Inactive) [ 09/Jul/12 ]

patch landed

Comment by Liang Zhen (Inactive) [ 10/Jul/12 ]

I have to reopen it, the patch didn't fix all issues:

  • thread-1: shutdown LNet, have cleaned all peers by lnet_peer_tables_cleanup()
  • thread-2: it's a LND thread which is calling lnet_parse()->lnet_nid2peer_locked()
    • lnet_find_peer_locked() returns NULL because LNet is shutting down
    • create a new peer, then found LNet is shutting down, so it just attach the new created peer on deathrow list and return
  • thread-1: lnet_peer_tables_destroy() calls LASSERT(cfs_list_empty(&ptable->pt_deathrow)), crash...
Comment by Liang Zhen (Inactive) [ 11/Jul/12 ]

the second patch is here: http://review.whamcloud.com/3369

Comment by Peter Jones [ 23/Jul/12 ]

Liang

Can this now be marked as resolved?

Peter

Comment by Liang Zhen (Inactive) [ 24/Jul/12 ]

patch landed

Generated at Sat Feb 10 01:18:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.