[LU-6953] LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG Created: 04/Aug/15  Updated: 10/Oct/15  Resolved: 05/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Cannot Reproduce Votes: 0
Labels: llnl
Environment:

lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

grove-mds1 crashed 2015-07-29 with the following LBUG:

2015-07-29 03:05:17 LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG
2015-07-29 03:05:17 Call Trace:
2015-07-29 03:05:17 [<ffffffffa07b28f5>] libcfs_debug dumpstack+0x55/0x80 [libcfs]
2015-07-29 03:05:17 Jul 29 03:05:17 [<ffffffffa07b2ef7>] lbug_with_loc+0x47/0xb0 [libcfs]
2015-07-29 03:05:17 grove-mds1 kerne [<ffffffffa0fcf9d8>] mdt_handle_common+0x13d8/0x1470 [mdt]
2015-07-29 03:05:17 l: LustreError:  [<ffffffffa100b625>] mds_regular_handle+0x15/0x20 [mdt]
2015-07-29 03:05:17 50126:0:(mdt_han [<ffffffffa0b05095>] ptlrpc_server_handle_request+0x305/0xc00 [ptlrpc]
2015-07-29 03:05:17 dler.c:3409:mdt_ [<ffffffffa07b352e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2015-07-29 03:05:17 recovery()) LBUG [<ffffffffa07c4845>] ? lc_watchdog_touch+0x65/0x170 [libcfs]

It was preceded by a ptlrpc debug message

2015-07-29 03:05:17 Lustre:50126:0:(mdt_handler.c:4508:mdt_recovery()) @@@ rq_xid 15027...0684 matches last_xid, expected REPLAY or RESENT flag (0) req@ffff...d1400 x15027...0684/t0(0) o101->28e0...cc83@172.20.15.14@o2ib500:0/0 lens 4616/0 e 0 to 0 dl 1438165072 ref 1 fl Interpret:/0/ffffffff rc 0/-1

For this system, I cannot extract bulk logs and add them to the ticket. We do we have a crash dump and console logs, I can obtain specific information that would help.

The mds was under severe memory pressure at the time of the lbug.

The MDS was responding very slowly at the time. At 3:05:03 it appears to have dropped 84,316 timed out requests (output from one DEBUG_REQ() call from within ptlrpc_server_handle_request() appears in the console log, followed by Skipped 84315 previous similar messages).



 Comments   
Comment by Peter Jones [ 04/Aug/15 ]

Mike

Could you please advise here?

Thanks

Peter

Comment by Oleg Drokin [ 04/Aug/15 ]

So this is another case of "client sent us something that we don't understand, let's panic", I guess.
We need to drop this LBUG a tthe very least as the first step and avoid this crash.
Though I am not sure how the condition might arise at all.

Also what do you mean you cannot extract bulk logs, crash+ the module to extract the logs is not working?

Comment by Christopher Morrone [ 04/Aug/15 ]

He meant mostly that those things cannot be shared outside of the lab.

Comment by Mikhail Pershin [ 06/Aug/15 ]

Olaf, is that possible to find more information about the request with that XID in log? Especially in client log where request was sent from. We have two possibilities here - request flag (RESENT or REPLAY) was dropped somehow or XID was assigned improperly. Client log may help to find out was that resent case or normal request.

Comment by Olaf Faaland [ 06/Aug/15 ]

Mikhail, that XID doesn't appear in the client's console log. The client with NID 172.20.15.14@o2ib500 logged nothing but the expected "lost connection" and "connection restored" messages during the hour leading up to the lbug. Unfortunately we have no additional information from the client.

I'm extracting the lustre debug logs from the crash dump and I'll check for that XID and post anything I find.

Comment by Olaf Faaland [ 07/Aug/15 ]

Debug logs from the mds crash dump had nothing obviously of interest. The XID in question did not occur in any message other than the one given above. Rest of the log messages are variations on request timed out, request took too long to process, etc.

Comment by Mikhail Pershin [ 05/Oct/15 ]

Were there other occurrences of this issue? There is not enough information to solve it, if it is happening regularly then it is possible to add more debug.

Comment by Olaf Faaland [ 05/Oct/15 ]

Mikhail,

This has not occurred again, so go ahead and close it. If it happens again we can reopen it.

thanks,
Olaf

Comment by Peter Jones [ 05/Oct/15 ]

ok thanks

Generated at Sat Feb 10 02:04:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.