Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
3
-
9223372036854775807
Description
OST doesn't start after replace_nids
00000020:01000004:19.0:1602101845.630802:0:58310:0:(obd_mount.c:193:lustre_start_simple()) Starting obd cslmo7fs-MDT0000-lwp-OST0001 (typ=lwp) 00000020:00000080:19.0:1602101845.630804:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf001 00000020:00000080:19.0:1602101845.630811:0:58310:0:(genops.c:451:class_newdev()) Allocate new device cslmo7fs-MDT0000-lwp-OST0001 (ffff9a88fc608000) 00000020:00000080:19.0:1602101845.630854:0:58310:0:(obd_config.c:431:class_attach()) OBD: dev 4 attached type lwp with refcount 1 00000020:00000080:19.0:1602101845.630855:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf003 00010000:00080000:19.0:1602101845.630919:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.53@o2ib at head 00000020:00000080:19.0:1602101845.631562:0:58310:0:(obd_config.c:538:class_setup()) finished setup of obd cslmo7fs-MDT0000-lwp-OST0001 (uuid cslmo7fs-MDT0000-lwp-OST0001_UUID) 00000010:01000000:19.0:1602101845.631571:0:58310:0:(lwp_dev.c:504:lwp_obd_connect()) connect #0 00000020:00000080:19.0:1602101845.631575:0:58310:0:(genops.c:1421:class_connect()) connect: client cslmo7fs-MDT0000-lwp-OST0001, cookie 0xf1eaeee46a1899ae 00000100:00080000:19.0:1602101845.631579:0:58310:0:(import.c:543:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: connect to NID 172.17.8.53@o2ib last attempt 0 00000100:00080000:19.0:1602101845.631581:0:58310:0:(import.c:619:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: import ffff9a88fc610000 using connection 172.17.8.53@o2ib/172.17.8.53@o2ib 00000100:00080000:19.0:1602101845.631595:0:58310:0:(pinger.c:376:ptlrpc_pinger_add_import()) adding pingable import cslmo7fs-MDT0000-lwp-OST0001_UUID->cslmo7fs-MDT0000_UUID 00010000:00080000:19.0:1602101845.631661:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.52@o2ib at tail 00000100:00000100:19.0:1602101845.631669:0:58310:0:(client.c:97:ptlrpc_uuid_to_connection()) cannot find peer 172.17.7.52@o2ib! 00000100:00080000:11.0F:1602101845.631670:0:58313:0:(client.c:1631:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING) req@ffff9a8e57656300 x1679925143670144/t0(0) o901->cslmo7fs-MDT0000-lwp-OST0001@172.17.8.53@o2ib:29/10 lens 248/4320 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 job:'' 00010000:00080000:19.0:1602101845.641922:0:58310:0:(ldlm_lib.c:77:import_set_conn()) can't find connection 172.17.7.52@o2ib 00000020:00020000:19.0:1602101845.641924:0:58310:0:(obd_mount_server.c:769:lustre_lwp_add_conn()) cslmo7fs-MDT0000-lwp-OST0001: can't add conn: rc = -2 00000040:00080000:19.0:1602101845.655437:0:58310:0:(llog.c:713:llog_process_thread()) stop processing plain 0x4c:10:0 index 42 count 60 00000020:01000000:7.0:1602101845.655451:0:58201:0:(obd_config.c:1876:class_config_parse_llog()) Processed log cslmo7fs-client gen 1-42 (rc=-2)
There are old nids (172.17.7.) in cslmo7fs-client llog. New ones 172.17.8.
#36 (224)marker 15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:04:20 2020- #37 (088)add_uuid nid=172.17.8.54@o2ib(0x50000ac110836) 0: 1:172.17.8.54@o2ib #38 (112)add_conn 0:cslmo7fs-OST0000-osc 1:172.17.7.55@o2ib #39 (224)END marker 15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:04:20 2020- #40 (224)marker 18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct 7 20:04:23 2020- #41 (088)add_uuid nid=172.17.8.53@o2ib(0x50000ac110835) 0: 1:172.17.8.53@o2ib #42 (112)add_conn 0:cslmo7fs-MDT0000-mdc 1:172.17.7.52@o2ib #43 (224)END marker 18 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct 7 20:04:23 2020- #44 (224)marker 19 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct 7 20:04:26 2020- #45 (088)add_uuid nid=172.17.8.55@o2ib(0x50000ac110837) 0: 1:172.17.8.55@o2ib #46 (112)add_conn 0:cslmo7fs-OST0001-osc 1:172.17.7.54@o2ib #47 (224)END marker 19 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct 7 20:04:26 2020- #48 (224)marker 20 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:16:54 2020- #49 (088)add_uuid nid=172.17.8.55@o2ib(0x50000ac110837) 0: 1:172.17.8.55@o2ib #50 (112)add_conn 0:cslmo7fs-OST0000-osc 1:172.17.8.55@o2ib #51 (224)END marker 20 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:16:54 2020- #53 (224)marker 23 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct 7 20:16:54 2020- #54 (088)add_uuid nid=172.17.8.52@o2ib(0x50000ac110834) 0: 1:172.17.8.52@o2ib #55 (112)add_conn 0:cslmo7fs-MDT0000-mdc 1:172.17.8.52@o2ib #56 (224)END marker 23 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct 7 20:16:54 2020- #57 (224)marker 24 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct 7 20:17:14 2020- #58 (088)add_uuid nid=172.17.8.54@o2ib(0x50000ac110836) 0: 1:172.17.8.54@o2ib #59 (112)add_conn 0:cslmo7fs-OST0001-osc 1:172.17.8.54@o2ib #60 (224)END marker 24 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct 7 20:17:14 2020-
Commands like that "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" adds sections to the llog files.
#36 (224)marker 15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:04:20 2020- #37 (088)add_uuid nid=172.17.8.54@o2ib(0x50000ac110836) 0: 1:172.17.8.54@o2ib #38 (112)add_conn 0:cslmo7fs-OST0000-osc 1:172.17.7.55@o2ib #39 (224)END marker 15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct 7 20:04:20 2020- #40 (224)marker 18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct 7 20:04:23 2020- lctl replace_nids processes "add failnid" section bugously. Change "add_uuid" but leave add_conn as is.
This should be fixed in replace_nids code.
As workaround I suggest exclude "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" parameters from scripts. They are duplicated by replace_nids command like:
lctl replace_nids cslmo7fs-OST0001 172.17.4.56@o2ib:172.17.4.54@o2ib
failove is placed after ":" here.