Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.5.0
-
None
-
3
-
11463
Description
On Hyperion, doing manual failover. OSTs are formatted thusly:
mkfs.lustre --reformat --ost --fsname lustre --mgsnode=$MGSNODE --index=$stinx --servicenode=${PRI[$i]} --servicenode=${SEC[$i]} --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd${DISK[$i]}" &
Result on disk:
Permanent disk data: Target: lustre-OST0013 Index: 19 Lustre FS: lustre Mount type: ldiskfs Flags: 0x1002 (OST no_primnode ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=192.168.120.5@o2ib failover.node=192.168.127.62@o2ib failover.node=192.168.127.66@o2ib
Proceedure:
- power off dit31
- run script which mounts OSTs on dit35
Result:
MGS gives this message: (.62 is primary, .66 is failover. .62 is STONITH dead at this time)h-agb5: Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected
MGS/MDS thereafter ignores these OSTs, continuing to give error messages pointing at primary NID:
Nov 5 16:18:16 hyperion-agb5 kernel: Lustre: 6143:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1383697096/real 1383697096] req@ffff8807bf317c00 x1450911076896768/t0(0) o8->lustre-OST0013-osc-MDT0000@192.168.127.62@o2ib:28/4 lens 400/544 e 0 to 1 dl 1383697151 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
This condition persists despite powercycle/restart of MGS/MDS
OSS node:
Reports one errorLDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: LustreError: 13a-8: Failed to get MGS log params and no local copy.
Never enters recovery, despite outputting Imperative Recovery message.
This condition persists despite repeated remount, remount with abort_recov, etc.
Clients continue to timeout on primary NID.
System remains in this state for further data gathering, suggestions appreciated.