Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4214

Hyperion - OST never recovers on failover node

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.6.0
    • Lustre 2.5.0
    • None
    • 3
    • 11463

    Description

      On Hyperion, doing manual failover. OSTs are formatted thusly:

      mkfs.lustre --reformat --ost --fsname lustre --mgsnode=$MGSNODE --index=$stinx --servicenode=${PRI[$i]} --servicenode=${SEC[$i]} --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd${DISK[$i]}" &
      

      Result on disk:

         Permanent disk data:
      Target:     lustre-OST0013
      Index:      19
      Lustre FS:  lustre
      Mount type: ldiskfs
      Flags:      0x1002
                    (OST no_primnode )
      Persistent mount opts: errors=remount-ro
      Parameters: mgsnode=192.168.120.5@o2ib failover.node=192.168.127.62@o2ib failover.node=192.168.127.66@o2ib
      

      Proceedure:

      • power off dit31
      • run script which mounts OSTs on dit35
        Result:
        MGS gives this message: (.62 is primary, .66 is failover. .62 is STONITH dead at this time)
        h-agb5: Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected
        

        MGS/MDS thereafter ignores these OSTs, continuing to give error messages pointing at primary NID:

        Nov  5 16:18:16 hyperion-agb5 kernel: Lustre: 6143:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1383697096/real 1383697096]  req@ffff8807bf317c00 x1450911076896768/t0(0) o8->lustre-OST0013-osc-MDT0000@192.168.127.62@o2ib:28/4 lens 400/544 e 0 to 1 dl 1383697151 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
        
        

        This condition persists despite powercycle/restart of MGS/MDS
        OSS node:
        Reports one error

        LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: 
        LustreError: 13a-8: Failed to get MGS log params and no local copy.
        

        Never enters recovery, despite outputting Imperative Recovery message.
        This condition persists despite repeated remount, remount with abort_recov, etc.

      Clients continue to timeout on primary NID.

      System remains in this state for further data gathering, suggestions appreciated.

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: