Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12810

replay-single test 20b fails with 'after 180548 > before N + 50'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.4, Lustre 2.12.5
    • None
    • DNE
    • 3
    • 9223372036854775807

    Description

      replay-single test_20b fails with for ldiskfs with errors similar to 'after 180548 > before 25792 + 50'.

      replay-single test_20b fails with for ZFS with errors similar to 'after 21504 > before 3072 + 2048'.

      We find example of this error in master and b2_12 since at least June 2019 for the failover test group with DNE configured.

      In all of these cases, we see recovery complete and the test tries to sync and test if space is freed three times and, if not, exits with an error. Looking at https://testing.whamcloud.com/test_sets/a5f46eb0-d40e-11e9-97d5-52540065bddc, we see the sync/test in the client test_log:

      trevis-40vm8: *.lustre-MDT0000.recovery_status status: COMPLETE
      Waiting for local destroys to complete
      CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1
      before 25800, after 1784144
      CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1
      before 25800, after 1784144
      CMD: trevis-40vm8 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      CMD: trevis-40vm6 lctl set_param -n osd*.*OS*.force_sync=1
      before 25800, after 1784144
       replay-single test_20b: @@@@@@ FAIL: after 1784144 > before 25800 + 50 
        Trace dump:
        = /usr/lib64/lustre/tests/test-framework.sh:5829:error()
        = /usr/lib64/lustre/tests/replay-single.sh:513:test_20b()
      

      In the client test logs for all of these failures, we see a ‘Transport endpoint’ error at the beginning of the test when trying to set force_sync. For example, for https://testing.whamcloud.com/test_sets/b73e377a-d349-11e9-9fc9-52540065bddc, we see

      == replay-single test 20b: write, unlink, eviction, replay (test mds_cleanup_orphans) ================ 19:21:45 (1568056905)
      CMD: trevis-35vm11 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0001-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0002-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0003-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0004-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-35vm11: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0005-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      CMD: trevis-35vm10 lctl set_param -n osd*.*OS*.force_sync=1
      

      In one case, we see the ‘Transport endpoint’ error during the final syncs before calling error():

      Waiting for local destroys to complete
      CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0001-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0002-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0003-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0004-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      trevis-23vm12: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0005-osc-MDT0000/force_sync=1: Transport endpoint is not connected
      CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1
      before 25832, after 180548
      CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1
      before 25832, after 180548
      CMD: trevis-23vm12 lctl set_param -n os[cd]*.*MDT*.force_sync=1
      CMD: trevis-23vm10 lctl set_param -n osd*.*OS*.force_sync=1
      before 25832, after 180548
       replay-single test_20b: @@@@@@ FAIL: after 180548 > before 25832 + 50 
      

      It seems like the forced syncs may not be taking place or not as many of them as we are trying are taking place.

      Logs for more recent failures are at
      https://testing.whamcloud.com/test_sets/56982af2-dfec-11e9-a0ba-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: