Description
I hit this on an idle system with lustre mounted using llmount.sh with 2 MDTs and 3 client mount points. I had used it for some HSM testing and ran racer but then I left it idle. Three hours after racer finished,
Lustre: DEBUG MARKER: == racer test complete, duration 309 sec == 11:33:46 (1375979626) 3 hours later MDT1 was spontaneously evicted from MDT0 and I saw the following:
Lustre: 24786:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1375990283/real 1375990283] req@ffff88017464a800 x1442809754504504/t0(0) o400->lustre-MDT0000-osp-MDT0001@0@lo:24/10 lens 224/224 e 0 to 1 dl 1375990290 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0000: Client lustre-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
Lustre: lustre-MDT0000-osp-MDT0001: Connection restored to lustre-MDT0000 (at 0@lo)
Lustre: 24787:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1375990288/real 1375990288] req@ffff880171622800 x1442809754504596/t0(0) o400->lustre-MDT0000-osp-MDT0001@0@lo:24/10 lens 224/224 e 0 to 1 dl 1375990295 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: 24788:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1375990295/real 1375990295] req@ffff880169776000 x1442809754504692/t0(0) o400->lustre-MDT0000-osp-MDT0001@0@lo:24/10 lens 224/224 e 0 to 1 dl 1375990302 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Lustre: lustre-MDT0000-osp-MDT0001: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
Lustre: lustre-MDT0000: Client lustre-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
LustreError: 26413:0:(mdt_handler.c:3176:mdt_tgt_connect()) ASSERTION( mti != ((void *)0) ) failed:
LustreError: 26413:0:(mdt_handler.c:3176:mdt_tgt_connect()) LBUG
Pid: 26413, comm: ll_ost_out01_00
Call Trace:
[<ffffffffa0ca1895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa0ca1e97>] lbug_with_loc+0x47/0xb0 [libcfs]
[<ffffffffa0ad5955>] mdt_tgt_connect+0x515/0x550 [mdt]
[<ffffffffa06379fd>] tgt_request_handle+0x57d/0xe30 [ptlrpc]
[<ffffffffa05f4638>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
[<ffffffffa0ca254e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[<ffffffffa0cb3a6f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
[<ffffffffa05eba49>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
[<ffffffff81055ab3>] ? __wake_up+0x53/0x70
[<ffffffffa05f59bd>] ptlrpc_main+0xabd/0x1700 [ptlrpc]
[<ffffffffa05f4f00>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
[<ffffffff81096936>] kthread+0x96/0xa0
[<ffffffff8100c0ca>] child_rip+0xa/0x20
[<ffffffff810968a0>] ? kthread+0x0/0xa0
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
I was ahead of master by two xattr patches and two small HSM patches, but I suspect that they are not the issue.
Attachments
Issue Links
- duplicates
-
LU-3751 disable OUT_PORTAL on OST for now
- Resolved