Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.12.3
-
None
-
3
-
9223372036854775807
Description
recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1' due to a client dd failure
+ echoerr 'Total free disk space is 9057280, 4k blocks to dd is 1018944' + echo 'Total free disk space is 9057280, 4k blocks to dd is 1018944' Total free disk space is 9057280, 4k blocks to dd is 1018944 + df /mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com + dd bs=4k count=1018944 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file dd: closing output file ‘/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file’: Input/output error
Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sets/cec5665e-ebfc-11e9-add9-52540065bddc, the last things we see are
Client load failed on node trevis-41vm3.trevis.whamcloud.com: /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__stdout.trevis-41vm3.trevis.whamcloud.com.log /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__debug.trevis-41vm3.trevis.whamcloud.com.log 2019-10-09 22:38:23 Terminating clients loads ... Duration: 86400 Server failover period: 1200 seconds Exited after: 20536 seconds Number of failovers before exit: mds1: 0 times ost1: 3 times ost2: 1 times ost3: 2 times ost4: 2 times ost5: 4 times ost6: 4 times ost7: 2 times Status: FAIL: rc=1 CMD: trevis-41vm3,trevis-41vm4 test -f /tmp/client-load.pid && { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; } trevis-41vm3: sh: line 1: kill: (21671) - No such process trevis-41vm4: sh: line 1: kill: (1981) - No such process Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.*.1570660705.log CMD: trevis-41vm3.trevis.whamcloud.com,trevis-41vm5,trevis-41vm6,trevis-41vm7,trevis-41vm8 /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1570660705.log; dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1570660705.log trevis-41vm6: invalid parameter 'dump_kernel' trevis-41vm6: open(dump_kernel) failed: No such file or directory trevis-41vm8: invalid parameter 'dump_kernel' trevis-41vm8: open(dump_kernel) failed: No such file or directory test_failover_ost returned 1 FAIL failover_ost (21626s)
In dmesg for ons of the OSSs (vm5), we see errors around this time
[ 1239.800591] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK [ 1239.844533] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1241.423507] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 4 times, and counting... [ 1241.642352] Lustre: DEBUG MARKER: ost6 has failed over 4 times, and counting... [ 1244.272791] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1 [ 1244.790911] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1 [ 1244.795648] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 7 previous similar messages [ 1245.884378] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1 [ 1245.887879] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 24 previous similar messages [ 1247.888052] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1 [ 1247.891749] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 62 previous similar messages [ 1249.852546] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1251.609811] Lustre: lustre-OST0006: Recovery over after 0:14, of 3 clients 3 recovered and 0 were evicted. [ 1251.638584] Lustre: lustre-OST0006: deleting orphan objects from 0x0:30580 to 0x0:30657 [ 1251.933830] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0002: extended recovery timer reaching hard limit: 180, extend: 1 [ 1251.936545] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 84 previous similar messages [ 1252.799744] Lustre: lustre-OST0004: Recovery over after 0:32, of 3 clients 3 recovered and 0 were evicted. [ 1252.811613] Lustre: lustre-OST0004: deleting orphan objects from 0x0:30578 to 0x0:30657 [ 1255.963626] Lustre: lustre-OST0000: recovery is timed out, evict stale exports [ 1255.966033] Lustre: lustre-OST0000: disconnecting 1 stale clients [ 1255.968200] LustreError: 6223:0:(tgt_grant.c:248:tgt_grant_sanity_check()) ofd_obd_disconnect: tot_granted 15728640 != fo_tot_granted 69206016 [ 1255.982021] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:00 [ 1258.899246] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:03 [ 1263.907198] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (2 recovered, 0 in progress, and 1 evicted) already passed deadline 0:08 [ 1265.865287] Lustre: lustre-OST0002: Recovery over after 0:55, of 3 clients 3 recovered and 0 were evicted. [ 1265.874909] Lustre: lustre-OST0002: deleting orphan objects from 0x0:30610 to 0x0:30689 [ 1266.197943] Lustre: lustre-OST0001: deleting orphan objects from 0x0:30126 to 0x0:30209 [ 1268.576630] Lustre: lustre-OST0000: Recovery over after 1:13, of 3 clients 2 recovered and 1 was evicted. [ 1268.579517] Lustre: Skipped 1 previous similar message [ 1268.580940] Lustre: lustre-OST0000: deleting orphan objects from 0x0:30740 to 0x0:30817 [ 1268.750115] Lustre: lustre-OST0003: deleting orphan objects from 0x0:30511 to 0x0:30593 [ 1268.752368] Lustre: lustre-OST0005: deleting orphan objects from 0x0:30479 to 0x0:30561 [ 1268.916992] Lustre: lustre-OST0000: Connection restored to 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp) [ 1268.920054] Lustre: Skipped 10 previous similar messages [ 1269.886545] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1271.213350] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1272.178278] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1272.182794] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1 previous similar message [ 1273.571547] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1273.575165] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 2 previous similar messages [ 1275.741885] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1275.744315] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 4 previous similar messages [ 1279.847093] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1279.850332] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 13 previous similar messages [ 1288.005172] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1288.009998] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 20 previous similar messages [ 1304.272729] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1304.275722] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 29 previous similar messages [ 1309.950454] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1309.953964] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages [ 1336.378100] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1336.382030] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 62 previous similar messages [ 1395.028643] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1395.030801] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 5 previous similar messages [ 1400.527325] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0 [ 1400.530467] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 102 previous similar messages [ 1525.162524] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1525.165481] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages [ 1790.173516] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0 [ 1790.175909] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages [ 2315.566821] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration: 86400 Server failover period: 1200 seconds Exited after: 20536 seconds Number of failovers before exit: mds1: 0 times ost1: 3 times ost2: 1 times ost3: 2 times ost4: 2 times ost5: 4 times ost6: 4 times [ 2316.767711] Lustre: DEBUG MARKER: Duration: 86400