Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.5.3
-
None
-
2
-
17312
Description
This morning within a few hours of each other, we hit this LBUG which caused the MDS to crash. The first time after reboot we had to abort recovery to get lustre back. We have a crashdump from the MDS.
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.805235] LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 375s: evicting client at 4966@gni100 ns: mdt-
atlas1-MDT0000_UUID lock: ffff881ec6e16c80/0xfc6e8aed747d1af2 lrc: 4/0,0 mode: CR/CR res: [0x2001a597a:0x85:0x0].0 bits 0x2 rrc: 4 type: IBT flags: 0x60200000000020 nid: 4966@gni100 remote: 0x20ee476ee499c158
expref: 132 pid: 16827 timeout: 4301930544 lvb_type: 0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.858358] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -1
16
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.874660] LustreError: 16827:0:(mdt_handler.c:4078:mdt_intent_reint()) LBUG
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.882757] Pid: 16827, comm: mdt00_224
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887151]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.887152] Call Trace:
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.891770] [<ffffffffa0407895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.899670] [<ffffffffa0407e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.906710] [<ffffffffa0d4379a>] mdt_intent_reint+0x51a/0x520 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.913933] [<ffffffffa0d40c4e>] mdt_intent_policy+0x3ae/0x770 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.921281] [<ffffffffa06de2e5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.928910] [<ffffffffa0707d0b>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.936903] [<ffffffff81069f75>] ? enqueue_entity+0x125/0x450
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.943544] [<ffffffffa0d41116>] mdt_enqueue+0x46/0xe0 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.950094] [<ffffffffa0d4602a>] mdt_handle_common+0x52a/0x1470 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.957515] [<ffffffffa0d833e5>] mds_regular_handle+0x15/0x20 [mdt]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.964770] [<ffffffffa0737fe5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.973547] [<ffffffffa04084ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.980677] [<ffffffffa04193cf>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.988407] [<ffffffffa072f6c9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7271.996116] [<ffffffff810546b9>] ? __wake_up_common+0x59/0x90
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.002774] [<ffffffffa073934d>] ptlrpc_main+0xaed/0x1760 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.009920] [<ffffffffa0738860>] ? ptlrpc_main+0x0/0x1760 [ptlrpc]
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.017040] [<ffffffff8109ab56>] kthread+0x96/0xa0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.022607] [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.028267] [<ffffffff8109aac0>] ? kthread+0x0/0xa0
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.033930] [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 1 10:03:15 atlas-mds1.ccs.ornl.gov kernel: [ 7272.039782]
Attachments
Issue Links
- is related to
-
LU-5934 mdt_intent_reint()) ASSERTION( rc == 0 ) failed: Error occurred but lock handle is still in use, rc = -2
-
- Resolved
-