[LU-6953] LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG Created: 04/Aug/15 Updated: 10/Oct/15 Resolved: 05/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
grove-mds1 crashed 2015-07-29 with the following LBUG: 2015-07-29 03:05:17 LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG 2015-07-29 03:05:17 Call Trace: 2015-07-29 03:05:17 [<ffffffffa07b28f5>] libcfs_debug dumpstack+0x55/0x80 [libcfs] 2015-07-29 03:05:17 Jul 29 03:05:17 [<ffffffffa07b2ef7>] lbug_with_loc+0x47/0xb0 [libcfs] 2015-07-29 03:05:17 grove-mds1 kerne [<ffffffffa0fcf9d8>] mdt_handle_common+0x13d8/0x1470 [mdt] 2015-07-29 03:05:17 l: LustreError: [<ffffffffa100b625>] mds_regular_handle+0x15/0x20 [mdt] 2015-07-29 03:05:17 50126:0:(mdt_han [<ffffffffa0b05095>] ptlrpc_server_handle_request+0x305/0xc00 [ptlrpc] 2015-07-29 03:05:17 dler.c:3409:mdt_ [<ffffffffa07b352e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2015-07-29 03:05:17 recovery()) LBUG [<ffffffffa07c4845>] ? lc_watchdog_touch+0x65/0x170 [libcfs] It was preceded by a ptlrpc debug message 2015-07-29 03:05:17 Lustre:50126:0:(mdt_handler.c:4508:mdt_recovery()) @@@ rq_xid 15027...0684 matches last_xid, expected REPLAY or RESENT flag (0) req@ffff...d1400 x15027...0684/t0(0) o101->28e0...cc83@172.20.15.14@o2ib500:0/0 lens 4616/0 e 0 to 0 dl 1438165072 ref 1 fl Interpret:/0/ffffffff rc 0/-1 For this system, I cannot extract bulk logs and add them to the ticket. We do we have a crash dump and console logs, I can obtain specific information that would help. The mds was under severe memory pressure at the time of the lbug. The MDS was responding very slowly at the time. At 3:05:03 it appears to have dropped 84,316 timed out requests (output from one DEBUG_REQ() call from within ptlrpc_server_handle_request() appears in the console log, followed by Skipped 84315 previous similar messages). |
| Comments |
| Comment by Peter Jones [ 04/Aug/15 ] |
|
Mike Could you please advise here? Thanks Peter |
| Comment by Oleg Drokin [ 04/Aug/15 ] |
|
So this is another case of "client sent us something that we don't understand, let's panic", I guess. Also what do you mean you cannot extract bulk logs, crash+ the module to extract the logs is not working? |
| Comment by Christopher Morrone [ 04/Aug/15 ] |
|
He meant mostly that those things cannot be shared outside of the lab. |
| Comment by Mikhail Pershin [ 06/Aug/15 ] |
|
Olaf, is that possible to find more information about the request with that XID in log? Especially in client log where request was sent from. We have two possibilities here - request flag (RESENT or REPLAY) was dropped somehow or XID was assigned improperly. Client log may help to find out was that resent case or normal request. |
| Comment by Olaf Faaland [ 06/Aug/15 ] |
|
Mikhail, that XID doesn't appear in the client's console log. The client with NID 172.20.15.14@o2ib500 logged nothing but the expected "lost connection" and "connection restored" messages during the hour leading up to the lbug. Unfortunately we have no additional information from the client. I'm extracting the lustre debug logs from the crash dump and I'll check for that XID and post anything I find. |
| Comment by Olaf Faaland [ 07/Aug/15 ] |
|
Debug logs from the mds crash dump had nothing obviously of interest. The XID in question did not occur in any message other than the one given above. Rest of the log messages are variations on request timed out, request took too long to process, etc. |
| Comment by Mikhail Pershin [ 05/Oct/15 ] |
|
Were there other occurrences of this issue? There is not enough information to solve it, if it is happening regularly then it is possible to add more debug. |
| Comment by Olaf Faaland [ 05/Oct/15 ] |
|
Mikhail, This has not occurred again, so go ahead and close it. If it happens again we can reopen it. thanks, |
| Comment by Peter Jones [ 05/Oct/15 ] |
|
ok thanks |