Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12572

sanity-pfl test_20b: Delete is not completed in 29 seconds

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.17.0
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.16.0, Lustre 2.15.6
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for jianyu <yujian@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/d66394cc-abc3-11e9-a0be-52540065bddc

      test_20b failed with the following error:

      CMD: trevis-70vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      CMD: trevis-70vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
      Delete is not completed in 29 seconds
      CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_*
      osc.lustre-OST0000-osc-MDT0000.sync_changes=0
      osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
      osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1
      osc.lustre-OST0000-osc-MDT0002.sync_changes=0
      osc.lustre-OST0000-osc-MDT0002.sync_in_flight=0
      osc.lustre-OST0000-osc-MDT0002.sync_in_progress=0
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity-pfl test_20b - test_20b returned 1

      Attachments

        Issue Links

          Activity

            [LU-12572] sanity-pfl test_20b: Delete is not completed in 29 seconds
            pjones Peter Jones added a comment -

            Merged for 2.17

            pjones Peter Jones added a comment - Merged for 2.17

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57704/
            Subject: LU-12572 tests: wait longer to clean up files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 88aa0f59a95f41a54d03ec4bb7edfe4d528c4f50

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/57704/ Subject: LU-12572 tests: wait longer to clean up files Project: fs/lustre-release Branch: master Current Patch Set: Commit: 88aa0f59a95f41a54d03ec4bb7edfe4d528c4f50

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57704
            Subject: LU-12572 tests: wait longer to clean up files
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5f9c688a0ba8f986774503125203c2aadfbda887

            gerrit Gerrit Updater added a comment - "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/57704 Subject: LU-12572 tests: wait longer to clean up files Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5f9c688a0ba8f986774503125203c2aadfbda887
            yujian Jian Yu added a comment -

            sanity-quota test 8 failed with the similar issue on Lustre b2_15 branch:
            https://testing.whamcloud.com/test_sets/8102d3fc-f790-4053-9e6b-51923ad0e251

            yujian Jian Yu added a comment - sanity-quota test 8 failed with the similar issue on Lustre b2_15 branch: https://testing.whamcloud.com/test_sets/8102d3fc-f790-4053-9e6b-51923ad0e251
            paf0186 Patrick Farrell added a comment - - edited

            Note that in contrast to the earlier failures James highlighted, the most recent failure does not have any errors in the force_sync set_param.  That seems highly relevant because such an error would explain getting stuck, but it's not present in this most recent failure.  I suspect in the earlier case we had an issue with the connection coming back up after the failover (so we failed to set sync, so we failed this check, etc), whereas here it looks more like a ZFS/storage issue.

            paf0186 Patrick Farrell added a comment - - edited Note that in contrast to the earlier failures James highlighted, the most recent failure does not have any errors in the force_sync set_param.  That seems highly relevant because such an error would explain getting stuck, but it's not present in this most recent failure.  I suspect in the earlier case we had an issue with the connection coming back up after the failover (so we failed to set sync, so we failed this check, etc), whereas here it looks more like a ZFS/storage issue .

            Brief notes from a minimal look:

            This is failing in the cleanup phase related to setting up ENOSPC/low space as part of the test, specifically:
            stack_trap "ost_watermarks_clear_enospc $tfile $ost_idx1 $wms" EXIT

            Sets up:

            ost_watermarks_clear_enospc

            And the failure is specifically related to wait_delete_completed_mds, this part:

                    local WAIT=0
                    while [[ $WAIT -ne $max_wait ]]; do
                            changes=$(do_nodes $mds2sync \
                                    "$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum)
                            #echo "$node: $changes changes on all"
                            if [[ $changes -eq 0 ]]; then
                                    wait_zfs_commit $SINGLEMDS                        # the occupied disk space will be released
                                    # only after TXGs are committed
                                    wait_zfs_commit ost1
                                    return 0
                            fi
                            sleep 1
                            WAIT=$((WAIT + 1))
                    done 

            On failure, we dump the sync parameters we're summing there, and they all show as zero.

            Most likely, this means we were waiting for ZFS commit (Unless there is a race condition between the "while [[ $WAIT .. ]] " timer and this check).

            Since this is just failing (I think?  I am relying on past comments & the most recent failure) in ZFS testing, I strongly suspect it's waiting for ZFS commit.

            So basically it looks to me like sometimes after failover we're waiting more than 30 seconds for ZFS commit.  I have not dug in further.

            paf0186 Patrick Farrell added a comment - Brief notes from a minimal look: This is failing in the cleanup phase related to setting up ENOSPC/low space as part of the test, specifically: stack_trap "ost_watermarks_clear_enospc $tfile $ost_idx1 $wms" EXIT Sets up: ost_watermarks_clear_enospc And the failure is specifically related to wait_delete_completed_mds, this part: local WAIT=0 while [[ $WAIT -ne $max_wait ]]; do changes=$(do_nodes $mds2sync \ "$LCTL get_param -n osc.*MDT*.sync_*" | calc_sum) #echo "$node: $changes changes on all" if [[ $changes -eq 0 ]]; then wait_zfs_commit $SINGLEMDS # the occupied disk space will be released # only after TXGs are committed wait_zfs_commit ost1 return 0 fi sleep 1 WAIT=$((WAIT + 1)) done On failure, we dump the sync parameters we're summing there, and they all show as zero. Most likely, this means we were waiting for ZFS commit (Unless there is a race condition between the "while [[ $WAIT .. ]] " timer and this check). Since this is just failing (I think?  I am relying on past comments & the most recent failure) in ZFS testing, I strongly suspect it's waiting for ZFS commit. So basically it looks to me like sometimes after failover we're waiting more than 30 seconds for ZFS commit.  I have not dug in further.
            emoly.liu Emoly Liu added a comment - - edited more on master:  https://testing.whamcloud.com/test_sets/53a2b7d2-5056-4261-9ae8-c9cf52f4c47c https://testing.whamcloud.com/test_sets/65adf9fc-24c8-40f2-b969-4a2a4ffde057
            yujian Jian Yu added a comment - Still failed on master branch: https://testing.whamcloud.com/test_sets/3c217dac-406f-11ea-9543-52540065bddc

            sanity-pfl test 20b is failing for review-dne-zfs-part-4 at a high rate. We’ve seen this test fail as early as 12 July 2019 with two failures for in June 2019; https://testing.whamcloud.com/test_sets/f5ff99e2-9a1c-11e9-b26a-52540065bddc and https://testing.whamcloud.com/test_sets/19cb1252-90b8-11e9-abe3-52540065bddc .

            In some of these failures, for example at https://testing.whamcloud.com/test_sets/b04ecaf6-c953-11e9-97d5-52540065bddc, we see an error in the client test_log:

            CMD: trevis-44vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1
            trevis-44vm4: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Connection timed out
            CMD: trevis-44vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_*
            

            In some failures, for example https://testing.whamcloud.com/test_sets/2894009c-b56f-11e9-b023-52540065bddc, we see that some of the osc parameters are not changed to 0:

            Delete is not completed in 29 seconds
            CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_*
            osc.lustre-OST0000-osc-MDT0000.sync_changes=0
            osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0
            osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1
            

            In this case, we also see the following in the OSS console

             [14628.488316] Lustre: DEBUG MARKER: dmesg
            [14629.089241] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg ffff9e3e5937f600 buffer[0] size 152 too small (required 184, opc=-1)
            [14629.091601] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) Skipped 815 previous similar messages
            

            Not sure if this causes errors, but for some failures, for example https://testing.whamcloud.com/test_sets/1996cd30-b2f5-11e9-b753-52540065bddc, we see the following on the OSS console

            [13183.416656] Lustre: ll_ost00_035: service thread pid 6546 was inactive for 40.145 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            
            jamesanunez James Nunez (Inactive) added a comment - sanity-pfl test 20b is failing for review-dne-zfs-part-4 at a high rate. We’ve seen this test fail as early as 12 July 2019 with two failures for in June 2019; https://testing.whamcloud.com/test_sets/f5ff99e2-9a1c-11e9-b26a-52540065bddc and https://testing.whamcloud.com/test_sets/19cb1252-90b8-11e9-abe3-52540065bddc . In some of these failures, for example at https://testing.whamcloud.com/test_sets/b04ecaf6-c953-11e9-97d5-52540065bddc , we see an error in the client test_log: CMD: trevis-44vm4 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 trevis-44vm4: error: set_param: setting /sys/fs/lustre/osc/lustre-OST0000-osc-MDT0000/force_sync=1: Connection timed out CMD: trevis-44vm4 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* In some failures, for example https://testing.whamcloud.com/test_sets/2894009c-b56f-11e9-b023-52540065bddc , we see that some of the osc parameters are not changed to 0: Delete is not completed in 29 seconds CMD: trevis-70vm4 /usr/sbin/lctl get_param osc.*MDT*.sync_* osc.lustre-OST0000-osc-MDT0000.sync_changes=0 osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0 osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1 In this case, we also see the following in the OSS console [14628.488316] Lustre: DEBUG MARKER: dmesg [14629.089241] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) msg ffff9e3e5937f600 buffer[0] size 152 too small (required 184, opc=-1) [14629.091601] LustreError: 5730:0:(pack_generic.c:425:lustre_msg_buf_v2()) Skipped 815 previous similar messages Not sure if this causes errors, but for some failures, for example https://testing.whamcloud.com/test_sets/1996cd30-b2f5-11e9-b753-52540065bddc , we see the following on the OSS console [13183.416656] Lustre: ll_ost00_035: service thread pid 6546 was inactive for 40.145 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
            yujian Jian Yu added a comment - +1 on master branch: https://testing.whamcloud.com/test_sets/5e5172b4-c8e8-11e9-90ad-52540065bddc

            People

              adilger Andreas Dilger
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: