[LU-14207] Replace_nids left old nids in add_conn field of failnid section of client llog Created: 10/Dec/20  Updated: 26/Feb/21  Resolved: 26/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Major
Reporter: Artem Blagodarenko (Inactive) Assignee: Artem Blagodarenko (Inactive)
Resolution: Fixed Votes: 0
Labels: patch

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

OST doesn't start after replace_nids

00000020:01000004:19.0:1602101845.630802:0:58310:0:(obd_mount.c:193:lustre_start_simple()) Starting obd cslmo7fs-MDT0000-lwp-OST0001 (typ=lwp)
00000020:00000080:19.0:1602101845.630804:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf001
00000020:00000080:19.0:1602101845.630811:0:58310:0:(genops.c:451:class_newdev()) Allocate new device cslmo7fs-MDT0000-lwp-OST0001 (ffff9a88fc608000)
00000020:00000080:19.0:1602101845.630854:0:58310:0:(obd_config.c:431:class_attach()) OBD: dev 4 attached type lwp with refcount 1
00000020:00000080:19.0:1602101845.630855:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf003
00010000:00080000:19.0:1602101845.630919:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.53@o2ib at head
00000020:00000080:19.0:1602101845.631562:0:58310:0:(obd_config.c:538:class_setup()) finished setup of obd cslmo7fs-MDT0000-lwp-OST0001 (uuid cslmo7fs-MDT0000-lwp-OST0001_UUID)
00000010:01000000:19.0:1602101845.631571:0:58310:0:(lwp_dev.c:504:lwp_obd_connect()) connect #0
00000020:00000080:19.0:1602101845.631575:0:58310:0:(genops.c:1421:class_connect()) connect: client cslmo7fs-MDT0000-lwp-OST0001, cookie 0xf1eaeee46a1899ae
00000100:00080000:19.0:1602101845.631579:0:58310:0:(import.c:543:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: connect to NID 172.17.8.53@o2ib last attempt 0
00000100:00080000:19.0:1602101845.631581:0:58310:0:(import.c:619:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: import ffff9a88fc610000 using connection 172.17.8.53@o2ib/172.17.8.53@o2ib
00000100:00080000:19.0:1602101845.631595:0:58310:0:(pinger.c:376:ptlrpc_pinger_add_import()) adding pingable import cslmo7fs-MDT0000-lwp-OST0001_UUID->cslmo7fs-MDT0000_UUID
00010000:00080000:19.0:1602101845.631661:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.52@o2ib at tail
00000100:00000100:19.0:1602101845.631669:0:58310:0:(client.c:97:ptlrpc_uuid_to_connection()) cannot find peer 172.17.7.52@o2ib!
00000100:00080000:11.0F:1602101845.631670:0:58313:0:(client.c:1631:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@ffff9a8e57656300 x1679925143670144/t0(0) o901->cslmo7fs-MDT0000-lwp-OST0001@172.17.8.53@o2ib:29/10 lens 248/4320 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 job:''
00010000:00080000:19.0:1602101845.641922:0:58310:0:(ldlm_lib.c:77:import_set_conn()) can't find connection 172.17.7.52@o2ib
00000020:00020000:19.0:1602101845.641924:0:58310:0:(obd_mount_server.c:769:lustre_lwp_add_conn()) cslmo7fs-MDT0000-lwp-OST0001: can't add conn: rc = -2
00000040:00080000:19.0:1602101845.655437:0:58310:0:(llog.c:713:llog_process_thread()) stop processing plain 0x4c:10:0 index 42 count 60
00000020:01000000:7.0:1602101845.655451:0:58201:0:(obd_config.c:1876:class_config_parse_llog()) Processed log cslmo7fs-client gen 1-42 (rc=-2)

There are old nids (172.17.7.) in cslmo7fs-client llog. New ones 172.17.8.


#36 (224)marker  15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
#37 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
#38 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.7.55@o2ib  
#39 (224)END   marker  15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
#40 (224)marker  18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
#41 (088)add_uuid  nid=172.17.8.53@o2ib(0x50000ac110835)  0:  1:172.17.8.53@o2ib  
#42 (112)add_conn  0:cslmo7fs-MDT0000-mdc  1:172.17.7.52@o2ib  
#43 (224)END   marker  18 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
#44 (224)marker  19 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:04:26 2020-
#45 (088)add_uuid  nid=172.17.8.55@o2ib(0x50000ac110837)  0:  1:172.17.8.55@o2ib  
#46 (112)add_conn  0:cslmo7fs-OST0001-osc  1:172.17.7.54@o2ib  
#47 (224)END   marker  19 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:04:26 2020-
#48 (224)marker  20 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:16:54 2020-
#49 (088)add_uuid  nid=172.17.8.55@o2ib(0x50000ac110837)  0:  1:172.17.8.55@o2ib  
#50 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.8.55@o2ib  
#51 (224)END   marker  20 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:16:54 2020-
#53 (224)marker  23 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:16:54 2020-
#54 (088)add_uuid  nid=172.17.8.52@o2ib(0x50000ac110834)  0:  1:172.17.8.52@o2ib  
#55 (112)add_conn  0:cslmo7fs-MDT0000-mdc  1:172.17.8.52@o2ib  
#56 (224)END   marker  23 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:16:54 2020-
#57 (224)marker  24 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:17:14 2020-
#58 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
#59 (112)add_conn  0:cslmo7fs-OST0001-osc  1:172.17.8.54@o2ib  
#60 (224)END   marker  24 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:17:14 2020-

Commands like that "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" adds sections to the llog files.

#36 (224)marker  15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
#37 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
#38 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.7.55@o2ib  
#39 (224)END   marker  15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
#40 (224)marker  18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
lctl replace_nids processes "add failnid" section bugously. Change "add_uuid" but leave add_conn as is.

This should be fixed in replace_nids code.

As workaround I suggest exclude "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" parameters from scripts. They are duplicated by replace_nids command like:

lctl replace_nids cslmo7fs-OST0001 172.17.4.56@o2ib:172.17.4.54@o2ib

failove is placed after ":" here.



 Comments   
Comment by Gerrit Updater [ 10/Dec/20 ]

Artem Blagodarenko (artem.blagodarenko@hpe.com) uploaded a new patch: https://review.whamcloud.com/40930
Subject: LU-14207 mgs: delete "add failnid" sections on replace_nids
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: fa782a0ad3c32a6b3b3ab717255a508eab9fd84f

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40930/
Subject: LU-14207 mgs: delete "add failnid" sections on replace_nids
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8910291fc5ca71588e865ac2ec3a7fbb881a7082

Comment by Peter Jones [ 26/Feb/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:07:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.