Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12858

recovery-mds-scale test failover_ost fails due to dd failure “dd: closing output file ‘/mnt/lustre/*/dd-file’: Input/output error”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.3
    • None
    • 3
    • 9223372036854775807

    Description

      recovery-mds-scale test_failover_ost fails with 'test_failover_ost returned 1' due to a client dd failure

      + echoerr 'Total free disk space is 9057280, 4k blocks to dd is 1018944'
      + echo 'Total free disk space is 9057280, 4k blocks to dd is 1018944'
      Total free disk space is 9057280, 4k blocks to dd is 1018944
      + df /mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com
      + dd bs=4k count=1018944 status=noxfer if=/dev/zero of=/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file
      dd: closing output file ‘/mnt/lustre/d0.dd-trevis-41vm3.trevis.whamcloud.com/dd-file’: Input/output error
      

      Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sets/cec5665e-ebfc-11e9-add9-52540065bddc, the last things we see are

      Client load  failed on node trevis-41vm3.trevis.whamcloud.com:
      /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__stdout.trevis-41vm3.trevis.whamcloud.com.log
      /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.run__debug.trevis-41vm3.trevis.whamcloud.com.log
      2019-10-09 22:38:23 Terminating clients loads ...
      Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           20536 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 3 times
      ost2: 1 times
      ost3: 2 times
      ost4: 2 times
      ost5: 4 times
      ost6: 4 times
      ost7: 2 times
      Status: FAIL: rc=1
      CMD: trevis-41vm3,trevis-41vm4 test -f /tmp/client-load.pid &&
              { kill -s TERM \$(cat /tmp/client-load.pid); rm -f /tmp/client-load.pid; }
      trevis-41vm3: sh: line 1: kill: (21671) - No such process
      trevis-41vm4: sh: line 1: kill: (1981) - No such process
      Dumping lctl log to /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.*.1570660705.log
      CMD: trevis-41vm3.trevis.whamcloud.com,trevis-41vm5,trevis-41vm6,trevis-41vm7,trevis-41vm8 /usr/sbin/lctl dk > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.debug_log.\$(hostname -s).1570660705.log;
               dmesg > /autotest/autotest2/2019-10-08/lustre-b2_12-el7_6-x86_64--failover--1_15__52___49326b19-b6be-4267-949a-9f21e10c9052/recovery-mds-scale.test_failover_ost.dmesg.\$(hostname -s).1570660705.log
      trevis-41vm6: invalid parameter 'dump_kernel'
      trevis-41vm6: open(dump_kernel) failed: No such file or directory
      trevis-41vm8: invalid parameter 'dump_kernel'
      trevis-41vm8: open(dump_kernel) failed: No such file or directory
      test_failover_ost returned 1
      FAIL failover_ost (21626s)
      

      In dmesg for ons of the OSSs (vm5), we see errors around this time

      [ 1239.800591] Lustre: DEBUG MARKER: ==== Checking the clients loads AFTER failover -- failure NOT OK
      [ 1239.844533] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1241.423507] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost6 has failed over 4 times, and counting...
      [ 1241.642352] Lustre: DEBUG MARKER: ost6 has failed over 4 times, and counting...
      [ 1244.272791] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
      [ 1244.790911] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
      [ 1244.795648] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 7 previous similar messages
      [ 1245.884378] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
      [ 1245.887879] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 24 previous similar messages
      [ 1247.888052] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0001: extended recovery timer reaching hard limit: 180, extend: 1
      [ 1247.891749] Lustre: 7752:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 62 previous similar messages
      [ 1249.852546] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1251.609811] Lustre: lustre-OST0006: Recovery over after 0:14, of 3 clients 3 recovered and 0 were evicted.
      [ 1251.638584] Lustre: lustre-OST0006: deleting orphan objects from 0x0:30580 to 0x0:30657
      [ 1251.933830] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) lustre-OST0002: extended recovery timer reaching hard limit: 180, extend: 1
      [ 1251.936545] Lustre: 9281:0:(ldlm_lib.c:1765:extend_recovery_timer()) Skipped 84 previous similar messages
      [ 1252.799744] Lustre: lustre-OST0004: Recovery over after 0:32, of 3 clients 3 recovered and 0 were evicted.
      [ 1252.811613] Lustre: lustre-OST0004: deleting orphan objects from 0x0:30578 to 0x0:30657
      [ 1255.963626] Lustre: lustre-OST0000: recovery is timed out, evict stale exports
      [ 1255.966033] Lustre: lustre-OST0000: disconnecting 1 stale clients
      [ 1255.968200] LustreError: 6223:0:(tgt_grant.c:248:tgt_grant_sanity_check()) ofd_obd_disconnect: tot_granted 15728640 != fo_tot_granted 69206016
      [ 1255.982021] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:00
      [ 1258.899246] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (0 recovered, 2 in progress, and 1 evicted) already passed deadline 0:03
      [ 1263.907198] Lustre: lustre-OST0000: Denying connection for new client 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp), waiting for 3 known clients (2 recovered, 0 in progress, and 1 evicted) already passed deadline 0:08
      [ 1265.865287] Lustre: lustre-OST0002: Recovery over after 0:55, of 3 clients 3 recovered and 0 were evicted.
      [ 1265.874909] Lustre: lustre-OST0002: deleting orphan objects from 0x0:30610 to 0x0:30689
      [ 1266.197943] Lustre: lustre-OST0001: deleting orphan objects from 0x0:30126 to 0x0:30209
      [ 1268.576630] Lustre: lustre-OST0000: Recovery over after 1:13, of 3 clients 2 recovered and 1 was evicted.
      [ 1268.579517] Lustre: Skipped 1 previous similar message
      [ 1268.580940] Lustre: lustre-OST0000: deleting orphan objects from 0x0:30740 to 0x0:30817
      [ 1268.750115] Lustre: lustre-OST0003: deleting orphan objects from 0x0:30511 to 0x0:30593
      [ 1268.752368] Lustre: lustre-OST0005: deleting orphan objects from 0x0:30479 to 0x0:30561
      [ 1268.916992] Lustre: lustre-OST0000: Connection restored to 539fb43a-6e36-de0f-64a5-8efe899e98d9 (at 10.9.5.239@tcp)
      [ 1268.920054] Lustre: Skipped 10 previous similar messages
      [ 1269.886545] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1271.213350] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1272.178278] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1272.182794] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 1 previous similar message
      [ 1273.571547] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1273.575165] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 2 previous similar messages
      [ 1275.741885] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1275.744315] LustreError: 16684:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 4 previous similar messages
      [ 1279.847093] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1279.850332] LustreError: 16811:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 13 previous similar messages
      [ 1288.005172] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1288.009998] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 20 previous similar messages
      [ 1304.272729] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1304.275722] LustreError: 16821:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 29 previous similar messages
      [ 1309.950454] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1309.953964] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 3 previous similar messages
      [ 1336.378100] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1336.382030] LustreError: 16822:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 62 previous similar messages
      [ 1395.028643] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1395.030801] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 5 previous similar messages
      [ 1400.527325] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) lustre-OST0000: cli 539fb43a-6e36-de0f-64a5-8efe899e98d9 claims 1703936 GRANT, real grant 0
      [ 1400.530467] LustreError: 16853:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 102 previous similar messages
      [ 1525.162524] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1525.165481] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 10 previous similar messages
      [ 1790.173516] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) lpni 10.9.5.244@tcp added to recovery queue. Health = 0
      [ 1790.175909] LNetError: 6140:0:(peer.c:3451:lnet_peer_ni_add_to_recoveryq_locked()) Skipped 14 previous similar messages
      [ 2315.566821] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               86400
      Server failover period: 1200 seconds
      Exited after:           20536 seconds
      Number of failovers before exit:
      mds1: 0 times
      ost1: 3 times
      ost2: 1 times
      ost3: 2 times
      ost4: 2 times
      ost5: 4 times
      ost6: 4 times
      [ 2316.767711] Lustre: DEBUG MARKER: Duration: 86400
      

      Attachments

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: