[LU-4214] Hyperion - OST never recovers on failover node Created: 06/Nov/13  Updated: 03/Nov/17  Resolved: 11/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-6273 Hard Failover replay-dual test_17: Fa... Resolved
is related to LU-8089 MGT/MDT mount fails on secondary HA node Resolved
Severity: 3
Rank (Obsolete): 11463

 Description   

On Hyperion, doing manual failover. OSTs are formatted thusly:

mkfs.lustre --reformat --ost --fsname lustre --mgsnode=$MGSNODE --index=$stinx --servicenode=${PRI[$i]} --servicenode=${SEC[$i]} --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd${DISK[$i]}" &

Result on disk:

   Permanent disk data:
Target:     lustre-OST0013
Index:      19
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.120.5@o2ib failover.node=192.168.127.62@o2ib failover.node=192.168.127.66@o2ib

Proceedure:

  • power off dit31
  • run script which mounts OSTs on dit35
    Result:
    MGS gives this message: (.62 is primary, .66 is failover. .62 is STONITH dead at this time)
    h-agb5: Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected
    

    MGS/MDS thereafter ignores these OSTs, continuing to give error messages pointing at primary NID:

    Nov  5 16:18:16 hyperion-agb5 kernel: Lustre: 6143:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1383697096/real 1383697096]  req@ffff8807bf317c00 x1450911076896768/t0(0) o8->lustre-OST0013-osc-MDT0000@192.168.127.62@o2ib:28/4 lens 400/544 e 0 to 1 dl 1383697151 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
    
    

    This condition persists despite powercycle/restart of MGS/MDS
    OSS node:
    Reports one error

    LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: 
    LustreError: 13a-8: Failed to get MGS log params and no local copy.
    

    Never enters recovery, despite outputting Imperative Recovery message.
    This condition persists despite repeated remount, remount with abort_recov, etc.

Clients continue to timeout on primary NID.

System remains in this state for further data gathering, suggestions appreciated.



 Comments   
Comment by Mikhail Pershin [ 28/Nov/13 ]

Cliff, bug is set as 'related' to LU-2059, does it really happen due to 2059 or that is just suspicion? Another question are there logs from MDT?

Comment by Cliff White (Inactive) [ 02/Dec/13 ]

I have no idea why that is marked as related. Not done by me. There was very little information in the logs, I posted it into the bug. The lack of any error messages in this situation is rather frustrating.

Comment by Mikhail Pershin [ 03/Dec/13 ]

OK, I see. I have no good idea about what is wrong there yet, but I have one about that message:

Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected

That looks like we need to fix target_handle_connect() to establish new connection for LWP client if NID was changes like we are doing for MDS connection. Patch is here http://review.whamcloud.com/#/c/8465/ and I am waiting for Johann reply on that.

Comment by Andreas Dilger [ 25/Apr/14 ]

Mike, Johann commented on the patch http://review.whamcloud.com/8465 so it needs to be refreshed.

Comment by Jodi Levi (Inactive) [ 11/Jun/14 ]

Patch landed to Master.

Comment by Gerrit Updater [ 11/Feb/15 ]

Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13726
Subject: LU-4214 lwp: fix LWP client connect logic
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8abaa93afd61f6c28e15e035a1a06ecf7f6d748e

Generated at Sat Feb 10 01:40:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.