Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4214

Hyperion - OST never recovers on failover node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • Lustre 2.6.0
    • Lustre 2.5.0
    • None
    • 3
    • 11463

      On Hyperion, doing manual failover. OSTs are formatted thusly:

      mkfs.lustre --reformat --ost --fsname lustre --mgsnode=$MGSNODE --index=$stinx --servicenode=${PRI[$i]} --servicenode=${SEC[$i]} --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd${DISK[$i]}" &
      

      Result on disk:

         Permanent disk data:
      Target:     lustre-OST0013
      Index:      19
      Lustre FS:  lustre
      Mount type: ldiskfs
      Flags:      0x1002
                    (OST no_primnode )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=192.168.120.5@o2ib failover.node=192.168.127.62@o2ib failover.node=192.168.127.66@o2ib
      

      Proceedure:

      • power off dit31
      • run script which mounts OSTs on dit35
        Result:
        MGS gives this message: (.62 is primary, .66 is failover. .62 is STONITH dead at this time)
        h-agb5: Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected
        

        MGS/MDS thereafter ignores these OSTs, continuing to give error messages pointing at primary NID:

        Nov  5 16:18:16 hyperion-agb5 kernel: Lustre: 6143:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1383697096/real 1383697096]  req@ffff8807bf317c00 x1450911076896768/t0(0) o8->lustre-OST0013-osc-MDT0000@192.168.127.62@o2ib:28/4 lens 400/544 e 0 to 1 dl 1383697151 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
        
        

        This condition persists despite powercycle/restart of MGS/MDS
        OSS node:
        Reports one error

        LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: 
        LustreError: 13a-8: Failed to get MGS log params and no local copy.
        

        Never enters recovery, despite outputting Imperative Recovery message.
        This condition persists despite repeated remount, remount with abort_recov, etc.

      Clients continue to timeout on primary NID.

      System remains in this state for further data gathering, suggestions appreciated.

            tappro Mikhail Pershin
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: