Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.1, Lustre 2.15.3
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_mds fails frequently with errors like the one seen here 'test_failover_mds returned 1'. Comparing log for a recent failure, at https://testing.whamcloud.com/test_sets/a2fd473a-6632-11e9-bd0e-52540065bddc to a successful test failover_mds run, we see some differences like the client getting evicted from the OSS.
Looking at the client 2 (vm3) console log for a recent failure of recovery-mds-scale test_failover_mds, we failover an MDS (mds1), get evicted by the MGS, and we check client loads after failover. So far, this is all expected. Normally, we wait for the MDS to recover, check client loads before another failover and fail another MDS. In this test session, the client gets evicted by an OSS
[ 812.471791] Lustre: DEBUG MARKER: mds1 has failed over 1 times, and counting... [ 815.102166] Lustre: lustre-MDT0000-mdc-ffff9239b97b8000: Connection restored to 10.9.5.180@tcp (at 10.9.5.180@tcp) [ 1171.815128] Lustre: 15182:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1555978912/real 0] req@ffff92398360ed00 x1631561684739616/t0(0) o400->lustre-OST0000-osc-ffff9239b97b8000@10.9.5.177@tcp:28/4 lens 224/224 e 0 to 1 dl 1555978968 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1171.822871] Lustre: lustre-OST0000-osc-ffff9239b97b8000: Connection to lustre-OST0000 (at 10.9.5.177@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1181.727091] Lustre: 15182:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1555978922/real 0] req@ffff92397c8a7600 x1631561684739904/t0(0) o400->lustre-OST0000-osc-ffff9239b97b8000@10.9.5.177@tcp:28/4 lens 224/224 e 0 to 1 dl 1555978978 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1181.732590] Lustre: 15182:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 28 previous similar messages [ 1191.529092] Lustre: 15182:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1555978932/real 0] req@ffff9239a4104d80 x1631561684740192/t0(0) o400->lustre-OST0000-osc-ffff9239b97b8000@10.9.5.177@tcp:28/4 lens 224/224 e 0 to 1 dl 1555978988 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 [ 1191.536273] Lustre: 15182:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 22 previous similar messages [ 1199.992506] LustreError: 15176:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff9239969d9e00 … [ 1200.009185] LustreError: 15176:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff9239a6f4a600 [ 1201.210952] LustreError: 11-0: lustre-OST0006-osc-ffff9239b97b8000: operation ost_write to node 10.9.5.177@tcp failed: rc = -107 [ 1206.885860] Lustre: Evicted from lustre-OST0000_UUID (at 10.9.5.177@tcp) after server handle changed from 0x930d901ede4acf37 to 0x930d901ede4b4780 [ 1206.887997] LustreError: 167-0: lustre-OST0000-osc-ffff9239b97b8000: This client was evicted by lustre-OST0000; in progress operations using this service will fail. [ 1206.898027] Lustre: 15182:0:(llite_lib.c:2738:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.9.5.179@tcp:10.9.5.180@tcp:/lustre/fid: [0x200000401:0x4:0x0]// may get corrupted (rc -5) [ 1207.024933] Lustre: lustre-OST0001-osc-ffff9239b97b8000: Connection restored to 10.9.5.177@tcp (at 10.9.5.177@tcp) [ 1207.029835] LustreError: 24254:0:(ldlm_resource.c:1146:ldlm_resource_complain()) lustre-OST0005-osc-ffff9239b97b8000: namespace resource [0x85:0x0:0x0].0x0 (ffff92399f2df900) refcount nonzero (1) after lock cleanup; forcing cleanup.
This might explain why client 2 has an error in the run_dd_debug log which leads to the test failure
+ df /mnt/lustre/d0.dd-trevis-34vm3.trevis.whamcloud.com + dd bs=4k count=730037 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-34vm3.trevis.whamcloud.com/dd-file dd: error writing ‘/mnt/lustre/d0.dd-trevis-34vm3.trevis.whamcloud.com/dd-file’: Cannot send after transport endpoint shutdown 449245+0 records in 449244+0 records out + '[' 1 -eq 0 ']' ++ date '+%F %H:%M:%S' + echoerr '2019-04-23 00:23:23: dd failed' + echo '2019-04-23 00:23:23: dd failed' 2019-04-23 00:23:23: dd failed
In the OSS console log, we see
[ 447.770074] LustreError: 167-0: lustre-MDT0000-lwp-OST0000: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
I haven’t found another instance of this type of failure, but there are many failover_mds failures to review.