[LU-13638] lnet: discard the callback Created: 05/Jun/20  Updated: 08/Jun/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Yang Sheng Assignee: Yang Sheng
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File lctl-debug-lnet-es400nvx1-vm3.txt.gz     File lctl-debug-lnet-es400nvx1-vm4.txt.gz    
Issue Links:
Related
is related to LU-14499 o2iblnd: LU-13368 changes cause shutd... Open
is related to LU-13534 Landing an LU-12678 high likely intro... Resolved
is related to LU-13368 lnet may be trying to use deleted rou... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We need discard callback in some case.



 Comments   
Comment by Yang Sheng [ 05/Jun/20 ]

Patch submit to: https://review.whamcloud.com/#/c/38845/

Comment by Gerrit Updater [ 10/Dec/20 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40937
Subject: LU-13638 ptlrpc: addition change for previous commit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8685dd43ec7fae51455a03433086265b0ccfad50

Comment by Shuichi Ihara [ 09/Mar/21 ]

patch https://review.whamcloud.com/#/c/38845/ introcued a issue that ko2iblnd_shutdown never completed.
A reproducible test case is below.

  1. Start Lustre with LNET-MR on the Infiniband network
  2. Turn off two IB ports on one of OSSs
  3. Umount OSTs on that particular OSS (assumed OSS failover)
  4. Two IB ports are back .
  5. OSTs are remounted on that OSS (assumed OSS failback)
  6. Stop all Lustre service and cleanup (lustre_rmmod) all lustre modules

When lustre modules were unloaded on all OSSs, some of OSS's (or all of OSS) shutdown never completed due to hanging at ko2iblnd_shutdown. I also tried second patch https://review.whamcloud.com/40937, but the problem was still exist.

btw, if server applied patch LU-14499 (reverted LU-13638 patch), this shutdown prolbem was gone.

Comment by Shuichi Ihara [ 09/Mar/21 ]

attached are debug log with enabling "net" flag. there were 4 x servers (10.0.11.22[4-7]@o2ib ) for test. I've captured debug of two servers which never completed shutdown.

Comment by Gerrit Updater [ 09/Mar/21 ]

Yang Sheng (ys@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41970
Subject: LU-13638 o2ib: test patch
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ca80d77843ebdab6963745fc474be2e3b8985aab

Comment by Yang Sheng [ 09/Mar/21 ]

Hi, Shuichi,

Could you test with patch: https://review.whamcloud.com/#/c/41970/ please? It should stack on top of https://review.whamcloud.com/#/c/40937/. TIA.

Thanks,
YangSheng

Comment by Yang Sheng [ 08/Jun/21 ]

Hi, Shuichi,

Do you have chance to verify the patch fixes the rmmod issue?

Thanks,
YangSheng

Generated at Sat Feb 10 03:02:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.