[LU-10384] Replace nids doesn't add failover nid and add_conn string to config Created: 14/Dec/17  Updated: 19/Apr/19  Resolved: 04/Jan/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Critical
Reporter: Artem Blagodarenko (Inactive) Assignee: Artem Blagodarenko (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
Related
is related to LUDOC-523 add proper documentation for replace_... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is MDT-MDT connection problem after failover.

Here is config for MDT0000 (no replace_nids applied for it)

#10 (224)marker  17 (flags=0x01, v2.5.1.0) lustre-MDT0000  ‘add osp’ Wed Aug 30 10:34:36 2017-
#11 (088)add_uuid  nid=192.168.0.112@tcp(0x20000c0a80070)  0:  1:192.168.0.112@tcp  
#12 (144)attach    0:lustre-MDT0000-osp-MDT0001  1:osp  2:lustre-MDT0001-mdtlov_UUID  
#13 (152)setup     0:lustre-MDT0000-osp-MDT0001  1:lustre-MDT0000_UUID  2:192.168.0.112@tcp  
#14 (088)add_uuid  nid=192.168.0.113@tcp(0x20000c0a80071)  0:  1:192.168.0.113@tcp  
#15 (120)add_conn  0:lustre-MDT0000-osp-MDT0001  1:192.168.0.113@tcp  
#16 (136)modify_mdc_tgts add 0:lustre-MDT0001-mdtlov  1:lustre-MDT0000_UUID  2:0  3:1  
#17 (224)END   marker  17 (flags

And MDT0001 config after replace_nids.

#19 (224)marker  20 (flags=0x01, v2.5.1.0) lustre-MDT0001  ‘add osp’ Wed Aug 30 10:34:36 2017-
#20 (088)add_uuid  nid=192.168.0.113@tcp(0x20000c0a80071)  0:  1:192.168.0.113@tcp  
#21 (144)attach    0:lustre-MDT0001-osp-MDT0000  1:osp  2:lustre-MDT0000-mdtlov_UUID  
#22 (152)setup     0:lustre-MDT0001-osp-MDT0000  1:lustre-MDT0001_UUID  2:192.168.0.113@tcp  
#23 (136)modify_mdc_tgts add 0:lustre-MDT0000-mdtlov  1:lustre-MDT0001_UUID  2:1  3:1  
#24 (224)END   marker  20 (flags=0x02, v2.5.1.0) lustre-MDT0001  ‘add osp’ Wed Aug 30 10:34:36 2017-

Replace nids doesn't add failover nid and add_conn string to config. This is the reason ops connection can not be established after failover.

The solution is add option to replace_nids that adds failover record.



 Comments   
Comment by Gerrit Updater [ 21/Dec/17 ]

Artem Blagodarenko (c17828@cray.com) uploaded a new patch: https://review.whamcloud.com/30624
Subject: LU-10384 mgs: replace_nids large string and failover support
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5ac0db60e4d7772877f35b7a49b079f7031ccd3

Comment by Artem Blagodarenko (Inactive) [ 12/Feb/18 ]

Hello James_Nunez
I see such message in logs

[16136.542599] LustreError: 10477:0:(mgc_request.c:1576:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 10.2.8.156@tcp
[16136.543707] Lustre: 10477:0:(mgc_request.c:1802:mgc_process_recover_nodemap_log()) MGC10.2.8.156@tcp: error processing recovery log lustre-mdtir: rc = -2
[16136.545012] LustreError: 10477:0:(mgc_request.c:2132:mgc_process_log()) MGC10.2.8.156@tcp: recover log lustre-mdtir failed, not fatal: rc = -2

10.2.8.156 - is new address that was applied by lctl replace_nids

[16118.896695] Lustre: DEBUG MARKER: lctl replace_nids lustre-MDT0000 10.2.8.156@tcp
[16119.429006] Lustre: DEBUG MARKER: lctl replace_nids lustre-MDT0001 10.2.8.156@tcp
[16119.748693] Lustre: DEBUG MARKER: lctl replace_nids lustre-OST0000 10.2.8.156@tcp
[16120.067065] Lustre: DEBUG MARKER: lctl replace_nids lustre-OST0001 10.2.8.156@tcp

I can not be sure now if my patch have no influence to 108a and 108b tests falls. I am going to investigate the tests hangs in this issue. Because I have no ready zfs-based installation here, and testing system can easily reproduce this issue, can I ask to support me sharing some extra finales?
First, I need files from CONFIGS directors on MGS to check if replace_nids precessed well.

Thanks,

Comment by Artem Blagodarenko (Inactive) [ 16/Apr/18 ]

jamesanunez Maloo set -1 to my patch. I checked locally config_sanity test_32d is failed with "rmmod: ERROR: Module zfs is in use" with and without my patch. Test test_75 is always passed in my local box (with/without my patch).

Can you verify patch? Thanks.

Comment by Artem Blagodarenko (Inactive) [ 16/Apr/18 ]

Same result for test_32a. Failed with/without my patch.

Comment by Gerrit Updater [ 04/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/30624/
Subject: LU-10384 mgs: replace_nids large string and failover support
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 09da1564d3794ca7b82e1c1791da253bee6178d4

Comment by Peter Jones [ 04/Jan/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 25/Feb/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34296
Subject: LU-10384 mgs: replace_nids large string and failover support
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7bc57e9b3bf9f4d77126133d68e40c8031934385

Comment by Gerrit Updater [ 19/Mar/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34296/
Subject: LU-10384 mgs: replace_nids large string and failover support
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 97654820f8a60b1752b92b79cd6bc254a4e48958

Generated at Sat Feb 10 02:34:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.