[LU-6852] MDS is evicted during 24-24 hours failover. Created: 14/Jul/15  Updated: 15/Oct/15  Resolved: 15/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Di Wang Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2

Issue Links:
Blocker
is blocking LU-6773 DNE2 Failover and recovery soak testing Closed
Related
is related to LU-6831 The ticket for tracking all DNE2 bugs Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During 24 hours failover test, I found one MDT is evicted due to timeout of blocking ast.

Lustre: DEBUG MARKER: ==== Checking the clients loads BEFORE failover -- failure NOT OK ELAPSED=1221 DURATION=86400 PERIOD=600
Lustre: DEBUG MARKER: Wait mds7 recovery complete before doing next failover...
Lustre: DEBUG MARKER: Checking clients are in FULL state before doing next failover...
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: mdc.lustre-MDT0006-mdc-*.mds_server_uuid in FULL state after 0 sec
Lustre: DEBUG MARKER: Starting failover on mds7
Lustre: Failing over lustre-MDT0006
Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 0/real 0]  req@ffff880806d34080 x1506706376470312/t0(0) o104->lustre-MDT0006@192.168.2.126@o2ib:15/16 lens 296/224 e 0 to 1 dl 0 ref 1 fl Rpc:EX/0/ffffffff rc -19/-1
Lustre: 11310:0:(client.c:2018:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
LustreError: 11310:0:(ldlm_lockd.c:668:ldlm_handle_ast_error()) ### client (nid 192.168.2.126@o2ib) failed to reply to blocking AST (req status 0 rc -19), evict it ns: mdt-lustre-MDT0006_UUID lock: ffff88080794e700/0xfe1fd5ccdec21d1a lrc: 4/0,0 mode: PW/PW res: [0x380000405:0x15e0:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x60000000000020 nid: 192.168.2.126@o2ib remote: 0x95ff49805a265b90 expref: 1607 pid: 11310 timeout: 4297597422 lvb_type: 0
LustreError: 138-a: lustre-MDT0006: A client on nid 192.168.2.126@o2ib was evicted due to a lock blocking callback time out: rc -19
Lustre: lustre-MDT0006: Not available for connect from 192.168.2.126@o2ib (stopping)
LustreError: 3551:0:(client.c:1142:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880fe21853c0 x1506706376470568/t0(0) o1000->lustre-MDT0000-osp-MDT0006@192.168.2.125@o2ib:24/4 lens 248/16608 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1

So the reason is that MDT6 restarts, so it can not be response to the blocking ast request in time, which caused it to be evicted. So again come to the question should we evict MDT during recovery. Anyway I will cook a patch to merge it into http://review.whamcloud.com/#/c/13224/



 Comments   
Comment by Di Wang [ 14/Jul/15 ]

http://review.whamcloud.com/#/c/13224/

Comment by Gerrit Updater [ 15/Oct/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13224/
Subject: LU-6852 ldlm: Do not evict MDS-MDS connection
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bee9c1897677473f12c0b807edd3e8fec452bc32

Comment by Joseph Gmitter (Inactive) [ 15/Oct/15 ]

Landed for 2.8

Generated at Sat Feb 10 02:03:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.