Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
None
-
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
-
3
-
9223372036854775807
Description
grove-mds1 crashed 2015-07-29 with the following LBUG:
2015-07-29 03:05:17 LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG 2015-07-29 03:05:17 Call Trace: 2015-07-29 03:05:17 [<ffffffffa07b28f5>] libcfs_debug dumpstack+0x55/0x80 [libcfs] 2015-07-29 03:05:17 Jul 29 03:05:17 [<ffffffffa07b2ef7>] lbug_with_loc+0x47/0xb0 [libcfs] 2015-07-29 03:05:17 grove-mds1 kerne [<ffffffffa0fcf9d8>] mdt_handle_common+0x13d8/0x1470 [mdt] 2015-07-29 03:05:17 l: LustreError: [<ffffffffa100b625>] mds_regular_handle+0x15/0x20 [mdt] 2015-07-29 03:05:17 50126:0:(mdt_han [<ffffffffa0b05095>] ptlrpc_server_handle_request+0x305/0xc00 [ptlrpc] 2015-07-29 03:05:17 dler.c:3409:mdt_ [<ffffffffa07b352e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2015-07-29 03:05:17 recovery()) LBUG [<ffffffffa07c4845>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
It was preceded by a ptlrpc debug message
2015-07-29 03:05:17 Lustre:50126:0:(mdt_handler.c:4508:mdt_recovery()) @@@ rq_xid 15027...0684 matches last_xid, expected REPLAY or RESENT flag (0) req@ffff...d1400 x15027...0684/t0(0) o101->28e0...cc83@172.20.15.14@o2ib500:0/0 lens 4616/0 e 0 to 0 dl 1438165072 ref 1 fl Interpret:/0/ffffffff rc 0/-1
For this system, I cannot extract bulk logs and add them to the ticket. We do we have a crash dump and console logs, I can obtain specific information that would help.
The mds was under severe memory pressure at the time of the lbug.
The MDS was responding very slowly at the time. At 3:05:03 it appears to have dropped 84,316 timed out requests (output from one DEBUG_REQ() call from within ptlrpc_server_handle_request() appears in the console log, followed by Skipped 84315 previous similar messages).
Were there other occurrences of this issue? There is not enough information to solve it, if it is happening regularly then it is possible to add more debug.