[LU-10124] lnetctl: lnetctl import --add not importing peers correctly Created: 16/Oct/17  Updated: 09/Jun/20  Resolved: 21/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Minor
Reporter: Malcolm Haak - NCI (Inactive) Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet, lnetctl
Environment:

Centos 7.4


Issue Links:
Related
Epic/Theme: lnet, lustre-2.10.1
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When importing a yaml config file for peers the import does not correctly set the Multi-Rail property when it is false.

An example:

peer:
    - primary nid: 10.112.1.60@o2ib8
      Multi-Rail: False
      peer ni:
        - nid: 10.112.1.60@o2ib8
          state: up

When imported it results in a running config of:

peer:
    - primary nid: 10.112.1.60@o2ib8
      Multi-Rail: True
      peer ni:
        - nid: 10.112.1.60@o2ib8
          state: up

For our config this isn't an issue yet, but as we will have a mix of multi-rail and non-multi-rail nodes this could be an issue moving forward.



 Comments   
Comment by Malcolm Haak - NCI (Inactive) [ 22/Oct/17 ]

We have encountered a second issue when importing peers.

Two peers with a single primary nid were merged during import into one peer with two peer ni's

The YAML file being used was one that was exported from the lnet router it was imported on.

Not sure if it is a race condition when importing 4000+ peers or some other issue.

Comment by James Nunez (Inactive) [ 20/Dec/17 ]

Sonia,
Would you please look into this issue?

Thank you.

Comment by Gerrit Updater [ 02/Feb/18 ]

Sonia Sharma (sonia.sharma@intel.com) uploaded a new patch: https://review.whamcloud.com/31138
Subject: LU-10124 lnet: Correctly add peer MR value while importing
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ddd0aedbe7a26053ae9c3da2c82c9e7a3e5a7c76

Comment by Malcolm Haak - NCI (Inactive) [ 17/Apr/18 ]

This patch should fix up that one issue,

But the second issue of peer merging does not appear to be solved.

Did you want a second ticket for this issue?

I will get you some log lines from the affected nodes.

Comment by Kim Sebo [ 17/Apr/18 ]

log line on lnet router is:

LNetError: 8507:0:(peer.c:806:lnet_add_peer_ni_to_prim_lpni()) Cannot add NID 10.9.60.1@o2ib3 owned by peer 10.9.60.1@o2ib3 to peer 10.9.12.38@o2ib3

The two 10.9.x.x addresses mentioned correspond to adjacent entries in the config file.

Comment by Sonia Sharma (Inactive) [ 17/Apr/18 ]

Is the issue happening even after applying the patch?
When the MR value is correctly imported, it would know that the peer is MR and thus another NID should be merged to the same peer.

Comment by Malcolm Haak - NCI (Inactive) [ 18/Apr/18 ]

I think you misunderstand. It's merging peers that AREN'T supposed to be merged.

Say peer A is in the file with a nid of 10.9.12.38@o2ib3:

peer:
    - primary nid: 10.9.12.38@o2ib3
      Multi-Rail: True
      peer ni:
        - nid: 10.9.12.28@o2ib3
          state: up

and peer B is next in the YAML file with a nid of 10.9.60.1@o2ib3

peer:
    - primary nid: 10.9.60.1@o2ib3
      Multi-Rail: False
      peer ni:
        - nid: 10.9.60.1@o2ib3
          state: up

So the resulting peer config YAML file should look like

peer:
    - primary nid: 10.9.12.38@o2ib3
      Multi-Rail: False
      peer ni:
        - nid: 10.9.12.38@o2ib3
          state: up
    - primary nid: 10.9.60.1@o2ib3
      Multi-Rail: False
      peer ni:
        - nid: 10.9.60.1@o2ib3
          state: up

It's trying to add 10.9.60.1@o2ib3 as an extra peer ni to 10.9.12.38@o2ib3.

This is wrong. They are separate peers. I've checked the YAML file and they are both described in YAML correctly. There is something wrong with the YAML parser that it causing it to not parse correctly.

Comment by Sonia Sharma (Inactive) [ 18/Apr/18 ]

Oh okay. So I noticed that I never updated the patch which had issue. Just did that.

And now with the patch, I just tried it on my system and I could not replicate the issue.

[root@lutfRtr1-linux ~]# lnetctl ping 10.211.55.9@tcp
ping:
    - primary nid: 10.211.55.9@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.211.55.9@tcp

[root@lutfRtr1-linux lustre-release]# lnetctl peer add --prim_nid 10.9.60.24@tcp

[root@lutfRtr1-linux lustre-release]# lnetctl peer show
peer:
    - primary nid: 10.211.55.9@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.211.55.9@tcp
          state: NA
    - primary nid: 10.9.60.24@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.9.60.24@tcp
          state: NA

[root@lutfRtr1-linux lustre-release]# lnetctl export > out.yaml

[root@lutfRtr1-linux lustre-release]# lnetctl peer show
peer:

[root@lutfRtr1-linux lustre-release]# lnetctl import < out.yaml

[root@lutfRtr1-linux lustre-release]# lnetctl peer show
peer:
    - primary nid: 10.211.55.9@tcp
      Multi-Rail: False
      peer ni:
        - nid: 10.211.55.9@tcp
          state: NA
    - primary nid: 10.9.60.24@tcp
      Multi-Rail: True
      peer ni:
        - nid: 10.9.60.24@tcp
          state: NA

How are you adding peers? Can you list the commands you are running to add peers.

Though I tried both ways - using "lnetctl" command and running traffic and was able to import peers correctly.

Comment by Malcolm Haak - NCI (Inactive) [ 20/Apr/18 ]

I think its a race condition. It happens because we are importing ~4000 nodes while in production (so along side normal discovery)

I doubt you will trigger it with two.

Comment by Sonia Sharma (Inactive) [ 02/May/18 ]

Hi Malcolm,

Can you please attach here the YAML file you are using for configuration. We can try reproducing the issue using that YAML file.

Thanks

Comment by Gerrit Updater [ 02/May/18 ]

Sonia Sharma (sonia.sharma@intel.com) uploaded a new patch: https://review.whamcloud.com/32255
Subject: LU-10124 lnet: Correctly add peer MR value while importing
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: c6fcf5a01fa4da0b026498b16927fa6c86cc1918

Comment by Sonia Sharma (Inactive) [ 02/May/18 ]

Just pushed the back-ported patch for b2_10 to make it easy for you to apply the patch and test.

Comment by Gerrit Updater [ 21/Sep/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/31138/
Subject: LU-10124 lnet: Correctly add peer MR value while importing
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 99494a28e6afde7c59e7f03045e63028ece1064d

Comment by Peter Jones [ 21/Sep/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 02/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/32255/
Subject: LU-10124 lnet: Correctly add peer MR value while importing
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 8103e94c1bd3000bc25da0d05f0ef3cafa1f91fd

Generated at Sat Feb 10 02:32:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.