[LU-5568] kernel crash when when network initialization failed Created: 01/Sep/14 Updated: 17/Dec/14 Resolved: 17/Dec/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, patch | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Epic/Theme: | lnet | ||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 15529 | ||||||||||||||||||||||||
| Description |
|
105.976884] Lustre: Lustre: Build Version: v2_6_51_0-g16c7568-CHANGED-3.10.1.el7_lustre |
| Comments |
| Comment by Wang Shilong (Inactive) [ 01/Sep/14 ] |
| Comment by Peter Jones [ 01/Sep/14 ] |
|
Amir Could you please review this patch? Thanks Peter |
| Comment by Isaac Huang (Inactive) [ 03/Sep/14 ] |
|
The bug seemed to be introduced by commit 92c51841c50cc4061c20b277d3f7c4468f2a80cc. While the proposed patch fixed the symptom, it left the underlying API inconsistency unfixed: the lnet internal initialization APIs are somewhat transaction-like in that if a init function fails it cleanups itself so the caller don't have to call the corresponding fini function, e.g. if lnet_prepare() fails there's no need to call lnet_unprepare(). But with 92c51841c50cc4061c20b277d3f7c4468f2a80cc, lnet_shutdown_lndnis() was removed from lnet_startup_lndnis(), so now the callers of lnet_startup_lndnis() will be responsible to clean up if lnet_startup_lndnis() has failed, which breaks the convention that init() functions clean up by themselves on failures. This kind of inconsistency will cause us troubles in the future. I'd suggest to: |
| Comment by Oleg Drokin [ 07/Oct/14 ] |
|
Wang, do you have plans of addressing Isaac's points in your patch? |
| Comment by Wang Shilong (Inactive) [ 08/Oct/14 ] |
|
Hi Oleg Drokin, Yeah, i did address lsaac's comment! |
| Comment by Wang Shilong (Inactive) [ 09/Oct/14 ] |
|
Hi Isaac Huang, Thanks very much for your comments, could you take a look and give me some response "When we goto failed here, the ni is no longer on the nilist, then where is the ni going to be freed?" My reply: lnet_shutdown_lndnis() could do that? btw, i am not sure why i reply comments at redmine, it did not give email or something... |
| Comment by Isaac Huang (Inactive) [ 13/Oct/14 ] |
|
lnet_shutdown_lndnis() will not free the ni if the failure happened early, i.e. the ni hasn't been added to any global lists. For example, if it failed in the first few conditional checks in lnet_startup_lndni(), lnet_shutdown_lndnis() will not be called at all. |
| Comment by Isaac Huang (Inactive) [ 13/Oct/14 ] |
|
I think the fix should be considered together with |
| Comment by Wang Shilong (Inactive) [ 20/Oct/14 ] |
|
Hi lsaac Huang, I am sorry for bothering you, could you please help review new version Best regards, |
| Comment by Peter Jones [ 30/Oct/14 ] |
|
Landed for 2.7 |
| Comment by Peter Jones [ 30/Oct/14 ] |
|
Patch reverted. Amir could you please look into this issue? Thanks |
| Comment by Gerrit Updater [ 23/Nov/14 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12512/ |