Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20439

Lustre server mount problem on MR kfilnd system when primary lnet nid interface is down

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • Lustre 2.16.1
    • None
    • 3
    • 9223372036854775807

      A simulated switch failure test was run on a Lustre fs using slingshot fabric, kfilnd Multi-rail servers, by pulling cables from one interface on each of the servers. Each server has two links, cxi0 and cxi1.
      For this test, 2 MDS servers n02, n03, 2 OSS servers n04, n05, cxi1 cable pulled on n02, n05 and cxi0 cable pulled on n03, n04.
      The Lustre fs was started, a client mount attempted, that client mount hung. On closer examination, we discovered the problem was between the two mds nodes, mdt0 could not connect to mdt1. mdt recovery_status on n02 showed:

      [root@kjlmo702 ~]# lctl get_param mdt.*.recovery_status
      mdt.kjlmo7fs-MDT0000.recovery_status=
      status: WAITING
      non-ready MDTs:  0001
      recovery_start: 1782420558
      time_waited: 68395
      [root@kjlmo702 ~]#
      

      Sample osp import, state from n02 for mdt1:

      [root@kjlmo702 ~]# lctl get_param osp.*.import
      osp.kjlmo7fs-MDT0001-osp-MDT0000.import=
      import:
          name: kjlmo7fs-MDT0001-osp-MDT0000
          target: kjlmo7fs-MDT0001_UUID
          state: DISCONN
          connect_flags: [ lov_index, version, acl, inode_bit_locks, adaptive_timeouts, mds_mds_connection, fid_is_enabled, full20, lfsck, multi_mod_rpcs, bulk_mbits, second_flags ]
          connect_data:
             flags: 0xa2400010450010a2
             instance: 0
             target_version: 2.16.1.1
             target_index: 1
             ibits_known: 0x2
             max_mod_rpcs: 0
          import_flags: [ replayable, connect_tried ]
          connection:
             failover_nids: [ "82@kfi", "0@lo" ]
             nids_stats:
                "82@kfi": { connects: 101, replied: 0, uptodate: false, sec_ago: 2 }
                "0@lo": { connects: 99, replied: 0, uptodate: true, sec_ago: 1 }
             current_connection: "0@lo"
             connection_attempts: 200
             generation: 2
             in-progress_invalidations: 0
             idle: 0 sec
          rpcs:
             inflight: 1
             unregistering: 0
             timeouts: 101
             avg_waittime: 275 usecs
          service_estimates:
             services: 66 sec
             network: 66 sec
          transactions:
             last_replay: 0
             peer_committed: 0
             last_checked: 0
      
      osp.kjlmo7fs-MDT0001-osp-MDT0000.state=
      current_state: DISCONN
      state_history:
       - [ 1782421874, CONNECTING ]
       - [ 1782421875, DISCONN ]
       - [ 1782421875, CONNECTING ]
       - [ 1782421875, DISCONN ]
       - [ 1782421880, CONNECTING ]
       - [ 1782421881, DISCONN ]
       - [ 1782421881, CONNECTING ]
       - [ 1782421881, DISCONN ]
       - [ 1782421899, CONNECTING ]
       - [ 1782421900, DISCONN ]
       - [ 1782421900, CONNECTING ]
       - [ 1782421900, DISCONN ]
       - [ 1782421905, CONNECTING ]
       - [ 1782421907, DISCONN ]
       - [ 1782421907, CONNECTING ]
       - [ 1782421907, DISCONN ]
      

      The fs uses a monitoring service that pings (ip + lnet) server peers to assess health of the host fabric links, and potentially failover targets if a server is determined to have lost connectivity. That service is started when the nodes are booted, it loads lnet. For this test, nodes were rebooted after cables were pulled, so each server had only a single lnet nid setup.

      n02/n03 are an HA pair, n02 has a single target, combined mgt/mdt0. n03 has a single target, mdt1. After the monitoring service runs ping checks, the servers all show peer entries for the other servers, a single nid for each server (for the 'UP' interface). On n02, the n03 peer entry shows 19@kfi (cxi1 link), the primary, down lnet nid for n03 is 82@kfi. After the server targets are mounted, n02 shows separate peer entries for 19@kfi, 82@kfi. Both show multi-rail "True", but show only a single peer nid.

      Full debug logs were captured, showing what looks like problems with peer addition for mdt1. The primary nid 82@kfi is created, then later the log shows:

      00000100:00000040:26.0:1782752887.365754:0:2013903:0:(lustre_peer.c:119:class_add_uuid()) found uuid 82@kfi 19@kfi cnt=2
      00000400:00001000:26.0:1782752887.365755:0:2013903:0:(api-ni.c:3222:LNetNIInit()) refs 2
      00000400:00000200:26.0:1782752887.365756:0:2013903:0:(api-ni.c:1680:lnet_nid4_cpt_hash()) Match nid 82@kfi to cpt 4
      00000400:00000200:26.0:1782752887.365757:0:2013903:0:(peer.c:1735:lnet_peer_add()) peer 82@kfi NID flags 0x100001: -17
      00000400:00000200:26.0:1782752887.365758:0:2013903:0:(api-ni.c:1680:lnet_nid4_cpt_hash()) Match nid 19@kfi to cpt 4
      00000400:00000200:26.0:1782752887.365759:0:2013903:0:(peer.c:1735:lnet_peer_add()) peer 19@kfi NID flags 0x1: -17
      00000100:00000040:26.0:1782752887.365760:0:2013903:0:(lustre_peer.c:122:class_add_uuid()) Add peer 19@kfi rc = 0
      

      error -17, EEXIST. mdt0 never establishes a connection to mdt1.

      I'll attach n02 debug log.

            hornc Chris Horn
            peggy Peggy Gazzola
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: