Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
During 24 hours failover test, I found one MDT is evicted due to timeout of blocking ast.
Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1221 DURATION=86400 PERIOD=600 Lustre: DEBUG MARKER: Wait mds7 recovery complete before doing next failover... Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover... Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec Lustre: DEBUG MARKER: Starting failover on mds7 Lustre: Failing over lustre-MDT0006 Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0] req@ffff880806d34080 x1506706376470312/t0(0) o104->lustre-MDT0006@192.168.2.126@o2ib:15/16 lens 296/224 e 0 to 1 dl 0 ref 1 fl Rpc:EX/0/ffffffff rc -19/-1 Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages LustreError: 11310:0:(ldlm_lockd.c:668:ldlm_handle_ast_error()) ### client (nid 192.168.2.126@o2ib) failed to reply to blocking AST (req status 0 rc -19), evict it ns: mdt-lustre-MDT0006_UUID lock: ffff88080794e700/0xfe1fd5ccdec21d1a lrc: 4/0,0 mode: PW/PW res: [0x380000405:0x15e0:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x60000000000020 nid: 192.168.2.126@o2ib remote: 0x95ff49805a265b90 expref: 1607 pid: 11310 timeout: 4297597422 lvb_type: 0 LustreError: 138-a: lustre-MDT0006: A client on nid 192.168.2.126@o2ib was evicted due to a lock blocking callback time out: rc -19 Lustre: lustre-MDT0006: Not available for connect from 192.168.2.126@o2ib (stopping) LustreError: 3551:0:(client.c:1142:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880fe21853c0 x1506706376470568/t0(0) o1000->lustre-MDT0000-osp-MDT0006@192.168.2.125@o2ib:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
So the reason is that MDT6 restarts, so it can not be response to the blocking ast request in time, which caused it to be evicted. So again come to the question should we evict MDT during recovery. Anyway I will cook a patch to merge it into http://review.whamcloud.com/#/c/13224/