[LU-4842] MDS crash and didn't failover (24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid) Created: 31/Mar/14 Updated: 30/Jan/16 Resolved: 30/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Rajeshwaran Ganesan | Assignee: | Hongchao Zhang |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Lustre 2.4.1 |
||
| Severity: | 3 |
| Rank (Obsolete): | 13339 |
| Description |
|
we have seen multiple times that the Lustre MDT did not start on the |
| Comments |
| Comment by Peter Jones [ 01/Apr/14 ] |
|
Hongchao Could you please advise on this one? Thanks Peter |
| Comment by Rajeshwaran Ganesan [ 03/Apr/14 ] |
|
Hello Hong Chao - Could you please give us an update? If you need any other log and other information, please let me know. Thanks, |
| Comment by Hongchao Zhang [ 04/Apr/14 ] |
|
Hi Rajesh, The log message "(24022:0:(mdt_open.c:1497:mdt_reint_open()) @@@ OPEN & CREAT not in open replay/by_fid)" means there is no object found at MDT Could you please attach the syslog of the MDT, the OSS and the clients? Thanks |
| Comment by Rajeshwaran Ganesan [ 11/Apr/14 ] |
|
Hi I have uploaded requested logs into the ftp.whamcloud.com under Lu-4842. Thanks, |
| Comment by Rajeshwaran Ganesan [ 16/Apr/14 ] |
|
Hello Hongchao, Please give us an update and please let me know, if you need any other logs etc. Thanks, |
| Comment by Hongchao Zhang [ 17/Apr/14 ] |
|
Hi Rajesh, the logs(Lu-4842/LU_logs.txt) doesn't contain error message. when did you get the debug log? it contains the logs about 50 seconds (1397232533 ~ 1397232605), could you please upload the syslog of nodes (MDT, OSS, clients) in this recovery failure? |
| Comment by Rajeshwaran Ganesan [ 29/Apr/14 ] |
|
Attaching additional logs. The same issue occured on 10/03. Attaching their logs too. 2014-03-10-es_lustre_showall_2014-03-10_155943.tar.bz2 Hope it helps, Please let us know, if you need any other logs |
| Comment by Hongchao Zhang [ 30/Apr/14 ] |
|
Did you fail the MDS at pfs2n2 (IB: 172.26.8.2, FS: pfs2dat1) over MDS at pfs2n3 (IB: 172.26.8.3, FS: pfs2dat1) at 2014-03-10? the MGS at pfs2n6 (IB: 172.26.8.6, FS: pfs2wor1) detected the failover and changed to pfs2n3, Mar 10 14:16:09 pfs2n6 kernel: : Lustre: 11986:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457362/real 0] req@ffff880745bc5c00 x1454141536457600/t0(0) o400->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457369 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:16:09 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.2@o2ib) was lost; in progress operations using this service will fail Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457394/real 1394457394] req@ffff8807bd6c4c00 x1454141536458640/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457400 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:16:40 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message Mar 10 14:17:10 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457419/real 0] req@ffff88081d026000 x1454141536459448/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457430 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x3a6efc7d7b9853c2 to 0x80cf882fb86516e3 Mar 10 14:17:24 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib) Mar 10 14:17:24 pfs2n6 kernel: : Lustre: Skipped 9 previous similar messages Mar 10 14:18:26 pfs2n6 kernel: : Lustre: 11992:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1394457494/real 1394457494] req@ffff880740871800 x1454141536461908/t0(0) o400->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 224/224 e 0 to 1 dl 1394457506 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:18:26 pfs2n6 kernel: : LustreError: 166-1: MGC172.26.8.2@o2ib: Connection to MGS (at 172.26.8.3@o2ib) was lost; in progress operations using this service will fail Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457581/real 0] req@ffff88060f659400 x1454141536464732/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.2@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457597 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:19:57 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 3 previous similar messages Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1394457731/real 0] req@ffff880802be6400 x1454141536469596/t0(0) o250->MGC172.26.8.2@o2ib@172.26.8.3@o2ib:26/25 lens 400/544 e 0 to 1 dl 1394457762 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Mar 10 14:22:42 pfs2n6 kernel: : Lustre: 11975:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 4 previous similar messages Mar 10 14:26:21 pfs2n6 kernel: : Lustre: Evicted from MGS (at MGC172.26.8.2@o2ib_0) after server handle changed from 0x80cf882fb86516e3 to 0xe62e03dfa42a5fd9 Mar 10 14:26:21 pfs2n6 kernel: : Lustre: MGC172.26.8.2@o2ib: Connection restored to MGS (at 172.26.8.3@o2ib) Mar 10 14:54:15 pfs2n6 kernel: : Lustre: pfs2wor1-MDT0000: haven't heard from client f7256a7b-c8ff-e78f-0a3d-a50b8fb47015 (at 172.26.5.106@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8810047b5c00, cur 1394459655 expire 1394459505 last 1394459428 but there seems no client connect to pfs2n3 except pfs2n4(OSS), pfs2n5(OSS) Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.5@o2ib (no target) Mar 10 14:26:09 pfs2n3 kernel: : LustreError: 137-5: pfs2dat1-MDT0000_UUID: not available for connect from 172.26.8.4@o2ib (no target) INFO: task tgt_recov:5224 blocked for more than 120 seconds. Mar 10 14:47:59 pfs2n3 kernel: : "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 10 14:47:59 pfs2n3 kernel: : tgt_recov D 000000000000000f 0 5224 2 0x00000080 Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3e00 0000000000000046 0000000000000000 0000000000000003 Mar 10 14:47:59 pfs2n3 kernel: : ffff880800ea3d90 ffffffff81055f96 ffff880800ea3da0 ffff880830298080 Mar 10 14:47:59 pfs2n3 kernel: : ffff880832d73ab8 ffff880800ea3fd8 000000000000fb88 ffff880832d73ab8 Mar 10 14:47:59 pfs2n3 kernel: : Call Trace: Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d8070>] ? check_for_clients+0x0/0x70 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d972d>] target_recovery_overseer+0x9d/0x230 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07d7d60>] ? exp_connect_healthy+0x0/0x20 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0ede>] target_recovery_thread+0x58e/0x1970 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20 Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffffa07e0950>] ? target_recovery_thread+0x0/0x1970 [ptlrpc] Mar 10 14:47:59 pfs2n3 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 then the problem a bit like |
| Comment by Hongchao Zhang [ 22/Jan/16 ] |
|
Hi Rajesh, |
| Comment by Rajeshwaran Ganesan [ 28/Jan/16 ] |
|
Hello Hongchao, Please close this case. Customer had upgraded to higher version of lustre and we are good now. Thanks, |
| Comment by John Fuchs-Chesney (Inactive) [ 30/Jan/16 ] |
|
Thanks Rajesh. ~ jfc. |