[LU-4842] MDS crash and didn't failover (24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid) Created: 31/Mar/14  Updated: 30/Jan/16  Resolved: 30/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.1
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Rajeshwaran Ganesan Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Lustre 2.4.1


Severity: 3
Rank (Obsolete): 13339

 Description   

we have seen multiple times that the Lustre MDT did not start on the
failover server. This happened for different file systems. Either the
configuration is wrong or something causes the failover server to hang
during recovery.
This happened on the file system HC3WORK at Dec 20 2013 but also on the
pfs2dat1 file system at Mar 10 and on multiple file systems
(pfs2wor1, pfs2dat2, pfs2dat1, HC3WORK) while I tried to deactivate ACLs
at Mar 17.



 Comments   
Comment by Peter Jones [ 01/Apr/14 ]

Hongchao

Could you please advise on this one?

Thanks

Peter

Comment by Rajeshwaran Ganesan [ 03/Apr/14 ]

Hello

Hong Chao - Could you please give us an update? If you need any other log and other information, please let me know.

Thanks,
Rajesh

Comment by Hongchao Zhang [ 04/Apr/14 ]

Hi Rajesh,

The log message "(24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid)" means there is no object found at MDT
for an open request without "MDS_OPEN_CREAT" flag, and it could be just created by some previous clients which could not recover with MDT correctly.

Could you please attach the syslog of the MDT, the OSS and the clients?
It will be much helpful if you have also collected the debug logs of these nodes (using "lctl dk > XXX" to get it).

Thanks
Hongchao

Comment by Rajeshwaran Ganesan [ 11/Apr/14 ]

Hi

I have uploaded requested logs into the ftp.whamcloud.com under Lu-4842.

Thanks,
Rajesh

Comment by Rajeshwaran Ganesan [ 16/Apr/14 ]

Hello Hongchao,

Please give us an update and please let me know, if you need any other logs etc.

Thanks,
Rajesh

Comment by Hongchao Zhang [ 17/Apr/14 ]

Hi Rajesh,

the logs(Lu-4842/LU_logs.txt) doesn't contain error message. when did you get the debug log? it contains the logs about 50 seconds (1397232533 ~ 1397232605),
the new logs will overwrite the old logs, the value in "/proc/sys/lnet/debug_mb" is the size of log memory in megabytes, could you please enlarge it by running
"echo 'new_size' > /proc/sys/lnet/debug_mb" and get the logs just after the recovery failed.

could you please upload the syslog of nodes (MDT, OSS, clients) in this recovery failure?
Thanks

Comment by Rajeshwaran Ganesan [ 29/Apr/14 ]

Attaching additional logs. The same issue occured on 10/03. Attaching their logs too.

2014-03-10-es_lustre_showall_2014-03-10_155943.tar.bz2
2014-03-10-es_lustre_showall_2014-03-10_152317.tar.bz2

Hope it helps, Please let us know, if you need any other logs

Comment by Hongchao Zhang [ 30/Apr/14 ]

Did you fail the MDS at pfs2n2 (IB: 172.26.8.2, FS: pfs2dat1) over MDS at pfs2n3 (IB: 172.26.8.3, FS: pfs2dat1) at 2014-03-10?

the MGS at pfs2n6 (IB: 172.26.8.6, FS: pfs2wor1) detected the failover and changed to pfs2n3,

Mar 10 14:16:09 pfs2n6 kernel: : Lustre: 11986:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457362/real 0]  req@ffff880745bc5c00 x1454141536457600/t0(0) o400->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:16:09 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.2@o2ib) was lost; in progress operations using this service will fail
Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457394/real 1394457394]  req@ffff8807bd6c4c00 x1454141536458640/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457400 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message
Mar 10 14:17:10 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457419/real 0]  req@ffff88081d026000 x1454141536459448/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x3a6efc7d7b9853c2 to 0x80cf882fb86516e3
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib)
Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Skipped 9 previous similar messages
Mar 10 14:18:26 pfs2n6 kernel: : Lustre: 11992:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457494/real 1394457494]  req@ffff880740871800 x1454141536461908/t0(0) o400->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457506 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:18:26 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.3@o2ib) was lost; in progress operations using this service will fail
Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457581/real 0]  req@ffff88060f659400 x1454141536464732/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457597 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457731/real 0]  req@ffff880802be6400 x1454141536469596/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457762 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
Mar 10 14:26:21 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x80cf882fb86516e3 to 0xe62e03dfa42a5fd9
Mar 10 14:26:21 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib)
Mar 10 14:54:15 pfs2n6 kernel: : Lustre: pfs2wor1-MDT0000: haven't heard from client f7256a7b-c8ff-e78f-0a3d-a50b8fb47015 (at 172.26.5.106@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8810047b5c00, cur 1394459655 expire 1394459505 last 1394459428

but there seems no client connect to pfs2n3 except pfs2n4(OSS), pfs2n5(OSS)

Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.5@o2ib (no target)
Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.4@o2ib (no target)
INFO: task tgt_recov:5224 blocked for more than 120 seconds.
Mar 10 14:47:59 pfs2n3 kernel: : "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 10 14:47:59 pfs2n3 kernel: : tgt_recov     D 000000000000000f     0  5224      2 0x00000080
Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3e00 0000000000000046 0000000000000000 0000000000000003
Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3d90 ffffffff81055f96 ffff880800ea3da0 ffff880830298080
Mar 10 14:47:59 pfs2n3 kernel: : ffff880832d73ab8 ffff880800ea3fd8 000000000000fb88 ffff880832d73ab8
Mar 10 14:47:59 pfs2n3 kernel: : Call Trace:
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff81055f96>] ? enqueue_task+0x66/0x80
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d8070>] ? check_for_clients+0x0/0x70 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d972d>] target_recovery_overseer+0x9d/0x230 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d7d60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0ede>] target_recovery_thread+0x58e/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc]
Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

then the problem a bit like LU-4722, which the clients doesn't fail over to the right server node.
could you please also dump the config log "pfs2dat1-client" and "pfs2wor1-client" at pfs2n3?
Thanks

Comment by Hongchao Zhang [ 22/Jan/16 ]

Hi Rajesh,
Do you need any more works on this ticket? Or can we close it?
Thanks

Comment by Rajeshwaran Ganesan [ 28/Jan/16 ]

Hello Hongchao,

Please close this case. Customer had upgraded to higher version of lustre and we are good now.

Thanks,
Rajesh

Comment by John Fuchs-Chesney (Inactive) [ 30/Jan/16 ]

Thanks Rajesh.

~ jfc.

Generated at Sat Feb 10 01:46:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.