Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
None
-
lustre-2.5.4-4chaos_2.6.32_504.16.2.1chaos.ch5.3.x86_64.x86_64
-
3
-
9223372036854775807
Description
grove-mds1 crashed 2015-07-29 with the following LBUG:
2015-07-29 03:05:17 LustreError: 50126:0:(mdt_handler.c:3409:mdt_recovery()) LBUG 2015-07-29 03:05:17 Call Trace: 2015-07-29 03:05:17 [<ffffffffa07b28f5>] libcfs_debug dumpstack+0x55/0x80 [libcfs] 2015-07-29 03:05:17 Jul 29 03:05:17 [<ffffffffa07b2ef7>] lbug_with_loc+0x47/0xb0 [libcfs] 2015-07-29 03:05:17 grove-mds1 kerne [<ffffffffa0fcf9d8>] mdt_handle_common+0x13d8/0x1470 [mdt] 2015-07-29 03:05:17 l: LustreError: [<ffffffffa100b625>] mds_regular_handle+0x15/0x20 [mdt] 2015-07-29 03:05:17 50126:0:(mdt_han [<ffffffffa0b05095>] ptlrpc_server_handle_request+0x305/0xc00 [ptlrpc] 2015-07-29 03:05:17 dler.c:3409:mdt_ [<ffffffffa07b352e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2015-07-29 03:05:17 recovery()) LBUG [<ffffffffa07c4845>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
It was preceded by a ptlrpc debug message
2015-07-29 03:05:17 Lustre:50126:0:(mdt_handler.c:4508:mdt_recovery()) @@@ rq_xid 15027...0684 matches last_xid, expected REPLAY or RESENT flag (0) req@ffff...d1400 x15027...0684/t0(0) o101->28e0...cc83@172.20.15.14@o2ib500:0/0 lens 4616/0 e 0 to 0 dl 1438165072 ref 1 fl Interpret:/0/ffffffff rc 0/-1
For this system, I cannot extract bulk logs and add them to the ticket. We do we have a crash dump and console logs, I can obtain specific information that would help.
The mds was under severe memory pressure at the time of the lbug.
The MDS was responding very slowly at the time. At 3:05:03 it appears to have dropped 84,316 timed out requests (output from one DEBUG_REQ() call from within ptlrpc_server_handle_request() appears in the console log, followed by Skipped 84315 previous similar messages).
ok thanks