Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0
Affects Version/s: Lustre 2.5.0
Labels:
None

Severity:
3
Rank (Obsolete):
11463

Description

On Hyperion, doing manual failover. OSTs are formatted thusly:

mkfs.lustre --reformat --ost --fsname lustre --mgsnode=$MGSNODE --index=$stinx --servicenode=${PRI[$i]} --servicenode=${SEC[$i]} --mkfsoptions='-t ext4 -J size=2048 -O extents -G 256 -i 69905' /dev/sd${DISK[$i]}" &

Result on disk:

   Permanent disk data:
Target:     lustre-OST0013
Index:      19
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x1002
              (OST no_primnode )
Persistent mount opts: errors=remount-ro
Parameters: mgsnode=192.168.120.5@o2ib failover.node=192.168.127.62@o2ib failover.node=192.168.127.66@o2ib

Proceedure:

power off dit31

run script which mounts OSTs on dit35
Result:
MGS gives this message: (.62 is primary, .66 is failover. .62 is STONITH dead at this time)

h-agb5: Lustre: lustre-MDT0000: Client lustre-MDT0000-lwp-OST0013_UUID seen on new nid 192.168.127.66@o2ib1 when existing nid 192.168.127.62@o2ib1 is already connected

MGS/MDS thereafter ignores these OSTs, continuing to give error messages pointing at primary NID:

Nov  5 16:18:16 hyperion-agb5 kernel: Lustre: 6143:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1383697096/real 1383697096]  req@ffff8807bf317c00 x1450911076896768/t0(0) o8->lustre-OST0013-osc-MDT0000@192.168.127.62@o2ib:28/4 lens 400/544 e 0 to 1 dl 1383697151 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1

This condition persists despite powercycle/restart of MGS/MDS
OSS node:
Reports one error

LDISKFS-fs (sdb): mounted filesystem with ordered data mode. quota=on. Opts: 
LustreError: 13a-8: Failed to get MGS log params and no local copy.

Never enters recovery, despite outputting Imperative Recovery message.
This condition persists despite repeated remount, remount with abort_recov, etc.

Clients continue to timeout on primary NID.

System remains in this state for further data gathering, suggestions appreciated.

Attachments

Issue Links

is related to

LU-6273 Hard Failover replay-dual test_17: Failover OST mount hang

Resolved

LU-8089 MGT/MDT mount fails on secondary HA node

Resolved

Activity

People

Assignee:: Mikhail Pershin

Reporter:: Cliff White (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 06/Nov/13 12:25 AM

Updated:: 03/Nov/17 3:41 PM

Resolved:: 11/Jun/14 1:44 PM