Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14207

Replace_nids left old nids in add_conn field of failnid section of client llog

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0
    • None
    • 3
    • 9223372036854775807

    Description

      OST doesn't start after replace_nids

      00000020:01000004:19.0:1602101845.630802:0:58310:0:(obd_mount.c:193:lustre_start_simple()) Starting obd cslmo7fs-MDT0000-lwp-OST0001 (typ=lwp)
      00000020:00000080:19.0:1602101845.630804:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf001
      00000020:00000080:19.0:1602101845.630811:0:58310:0:(genops.c:451:class_newdev()) Allocate new device cslmo7fs-MDT0000-lwp-OST0001 (ffff9a88fc608000)
      00000020:00000080:19.0:1602101845.630854:0:58310:0:(obd_config.c:431:class_attach()) OBD: dev 4 attached type lwp with refcount 1
      00000020:00000080:19.0:1602101845.630855:0:58310:0:(obd_config.c:1128:class_process_config()) processing cmd: cf003
      00010000:00080000:19.0:1602101845.630919:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.53@o2ib at head
      00000020:00000080:19.0:1602101845.631562:0:58310:0:(obd_config.c:538:class_setup()) finished setup of obd cslmo7fs-MDT0000-lwp-OST0001 (uuid cslmo7fs-MDT0000-lwp-OST0001_UUID)
      00000010:01000000:19.0:1602101845.631571:0:58310:0:(lwp_dev.c:504:lwp_obd_connect()) connect #0
      00000020:00000080:19.0:1602101845.631575:0:58310:0:(genops.c:1421:class_connect()) connect: client cslmo7fs-MDT0000-lwp-OST0001, cookie 0xf1eaeee46a1899ae
      00000100:00080000:19.0:1602101845.631579:0:58310:0:(import.c:543:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: connect to NID 172.17.8.53@o2ib last attempt 0
      00000100:00080000:19.0:1602101845.631581:0:58310:0:(import.c:619:import_select_connection()) cslmo7fs-MDT0000-lwp-OST0001: import ffff9a88fc610000 using connection 172.17.8.53@o2ib/172.17.8.53@o2ib
      00000100:00080000:19.0:1602101845.631595:0:58310:0:(pinger.c:376:ptlrpc_pinger_add_import()) adding pingable import cslmo7fs-MDT0000-lwp-OST0001_UUID->cslmo7fs-MDT0000_UUID
      00010000:00080000:19.0:1602101845.631661:0:58310:0:(ldlm_lib.c:115:import_set_conn()) imp ffff9a88fc610000@cslmo7fs-MDT0000-lwp-OST0001: add connection 172.17.8.52@o2ib at tail
      00000100:00000100:19.0:1602101845.631669:0:58310:0:(client.c:97:ptlrpc_uuid_to_connection()) cannot find peer 172.17.7.52@o2ib!
      00000100:00080000:11.0F:1602101845.631670:0:58313:0:(client.c:1631:ptlrpc_send_new_req()) @@@ req waiting for recovery: (FULL != CONNECTING)  req@ffff9a8e57656300 x1679925143670144/t0(0) o901->cslmo7fs-MDT0000-lwp-OST0001@172.17.8.53@o2ib:29/10 lens 248/4320 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 job:''
      00010000:00080000:19.0:1602101845.641922:0:58310:0:(ldlm_lib.c:77:import_set_conn()) can't find connection 172.17.7.52@o2ib
      00000020:00020000:19.0:1602101845.641924:0:58310:0:(obd_mount_server.c:769:lustre_lwp_add_conn()) cslmo7fs-MDT0000-lwp-OST0001: can't add conn: rc = -2
      00000040:00080000:19.0:1602101845.655437:0:58310:0:(llog.c:713:llog_process_thread()) stop processing plain 0x4c:10:0 index 42 count 60
      00000020:01000000:7.0:1602101845.655451:0:58201:0:(obd_config.c:1876:class_config_parse_llog()) Processed log cslmo7fs-client gen 1-42 (rc=-2)
      

      There are old nids (172.17.7.) in cslmo7fs-client llog. New ones 172.17.8.

      
      #36 (224)marker  15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
      #37 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
      #38 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.7.55@o2ib  
      #39 (224)END   marker  15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
      #40 (224)marker  18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
      #41 (088)add_uuid  nid=172.17.8.53@o2ib(0x50000ac110835)  0:  1:172.17.8.53@o2ib  
      #42 (112)add_conn  0:cslmo7fs-MDT0000-mdc  1:172.17.7.52@o2ib  
      #43 (224)END   marker  18 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
      #44 (224)marker  19 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:04:26 2020-
      #45 (088)add_uuid  nid=172.17.8.55@o2ib(0x50000ac110837)  0:  1:172.17.8.55@o2ib  
      #46 (112)add_conn  0:cslmo7fs-OST0001-osc  1:172.17.7.54@o2ib  
      #47 (224)END   marker  19 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:04:26 2020-
      #48 (224)marker  20 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:16:54 2020-
      #49 (088)add_uuid  nid=172.17.8.55@o2ib(0x50000ac110837)  0:  1:172.17.8.55@o2ib  
      #50 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.8.55@o2ib  
      #51 (224)END   marker  20 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:16:54 2020-
      #53 (224)marker  23 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:16:54 2020-
      #54 (088)add_uuid  nid=172.17.8.52@o2ib(0x50000ac110834)  0:  1:172.17.8.52@o2ib  
      #55 (112)add_conn  0:cslmo7fs-MDT0000-mdc  1:172.17.8.52@o2ib  
      #56 (224)END   marker  23 (flags=0x02, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:16:54 2020-
      #57 (224)marker  24 (flags=0x01, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:17:14 2020-
      #58 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
      #59 (112)add_conn  0:cslmo7fs-OST0001-osc  1:172.17.8.54@o2ib  
      #60 (224)END   marker  24 (flags=0x02, v2.12.4.2) cslmo7fs-OST0001 'add failnid' Wed Oct  7 20:17:14 2020-
      

      Commands like that "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" adds sections to the llog files.

      #36 (224)marker  15 (flags=0x01, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
      #37 (088)add_uuid  nid=172.17.8.54@o2ib(0x50000ac110836)  0:  1:172.17.8.54@o2ib  
      #38 (112)add_conn  0:cslmo7fs-OST0000-osc  1:172.17.7.55@o2ib  
      #39 (224)END   marker  15 (flags=0x02, v2.12.4.2) cslmo7fs-OST0000 'add failnid' Wed Oct  7 20:04:20 2020-
      #40 (224)marker  18 (flags=0x01, v2.12.4.2) cslmo7fs-MDT0000 'add failnid' Wed Oct  7 20:04:23 2020-
      lctl replace_nids processes "add failnid" section bugously. Change "add_uuid" but leave add_conn as is.
      

      This should be fixed in replace_nids code.

      As workaround I suggest exclude "--erase-param failover.node --param failover.node=172.17.5.52@o2ib" parameters from scripts. They are duplicated by replace_nids command like:

      lctl replace_nids cslmo7fs-OST0001 172.17.4.56@o2ib:172.17.4.54@o2ib
      

      failove is placed after ":" here.

      Attachments

        Activity

          People

            artem_blagodarenko Artem Blagodarenko (Inactive)
            artem_blagodarenko Artem Blagodarenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: