Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.16.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/aa93e973-8c6b-4568-a9ce-a2747abd4f7b

      test_16j failed with the following error:

      Timeout occurred after 418 minutes, last suite running was sanityn
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/99043 - 4.18.0-477.21.1.el8_8.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/99043 - 4.18.0-477.21.1.el8_lustre.x86_64

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanityn test_16j - Timeout occurred after 418 minutes, last suite running was sanityn

      Attachments

        Issue Links

          Activity

            [LU-17156] sanityn test_16j: timeout
            pjones Peter Jones added a comment -

            Landed for 2.16

            pjones Peter Jones added a comment - Landed for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52973/
            Subject: LU-17156 tests: Improve zfs_or_rotational()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: b0262a00714f2d77b6d6ba745169c8fa18d38b32

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/52973/ Subject: LU-17156 tests: Improve zfs_or_rotational() Project: fs/lustre-release Branch: master Current Patch Set: Commit: b0262a00714f2d77b6d6ba745169c8fa18d38b32

             

            <snip> 
            I've updated the patch with the fix...
            <snip>

            Andreas, yes saw that. Thanks again!

             

            ... It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here.

            Correct. Understood.

            arshad512 Arshad Hussain added a comment -   <snip> I've updated the patch with the fix... <snip> Andreas, yes saw that. Thanks again!   ... It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here. Correct. Understood.

            To be clear, if these tests run the full version on ZFS (or maybe even just on a rotational disk), they will time out.  It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here.

            paf0186 Patrick Farrell added a comment - To be clear, if these tests run the full version on ZFS (or maybe even just on a rotational disk), they will time out.  It just takes too long.  So an error in zfs_or_rotational is enough to explain timeouts here.

            The _UUID string is just a result of the "echo" command only removing the first instance of _UUID in the output. When there is only a single OST name printed this will not be a problem.

            I've updated the patch with the fix. It looks like this has been an issue with the zfs_or_rotational() function since it was landed back on 2023-08-24, and it meant that that tests intended not to run on ZFS (because DIO on ZFS is very slow) were being run anyway and sometimes passing, and sometimes timing out.

            adilger Andreas Dilger added a comment - The _UUID string is just a result of the "echo" command only removing the first instance of _UUID in the output. When there is only a single OST name printed this will not be a problem. I've updated the patch with the fix. It looks like this has been an issue with the zfs_or_rotational() function since it was landed back on 2023-08-24, and it meant that that tests intended not to run on ZFS (because DIO on ZFS is very slow) were being run anyway and sometimes passing, and sometimes timing out.
            arshad512 Arshad Hussain added a comment - - edited

            Yes definitely. This looks like test case issue. I was initially thinking some unusual "lfs osts" output leading to this. Looked at the parsing, all seems correct. So, I put a patch to dump more log on error. (I just saw your review on that, will take care of it). 

            There is a space in there, and "lustre-OST0000" is shown twice?

            Yes. and also, We should not have "_UUID" string there. It should have been remove.

            arshad512 Arshad Hussain added a comment - - edited Yes definitely. This looks like test case issue. I was initially thinking some unusual "lfs osts" output leading to this. Looked at the parsing, all seems correct. So, I put a patch to dump more log on error. (I just saw your review on that, will take care of it).  There is a space in there, and "lustre-OST0000" is shown twice? Yes. and also, We should not have "_UUID" string there. It should have been remove.

            It looks like there is a typo in the expansion:

            lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational
            

            There is a space in there, and "lustre-OST0000" is shown twice?

            adilger Andreas Dilger added a comment - It looks like there is a typo in the expansion: lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational There is a space in there, and " lustre-OST0000 " is shown twice?

            "Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52973
            Subject: LU-17156 tests: Improve zfs_or_rotational()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e4bc1ad986f58e1d29956e274f76a76dc093e0c5

            gerrit Gerrit Updater added a comment - "Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52973 Subject: LU-17156 tests: Improve zfs_or_rotational() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e4bc1ad986f58e1d29956e274f76a76dc093e0c5

            From: https://testing.whamcloud.com/test_logs/6315cba4-8e7e-490d-ba12-d4fad74d3784/show_text

            == sanityn test 16j: race dio with buffered i/o ========== 14:13:36 (1697552016)
            1+0 records in
            1+0 records out
            26214400 bytes (26 MB, 25 MiB) copied, 0.294639 s, 89.0 MB/s
            CMD: trevis-31vm3 /usr/sbin/lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational
            trevis-31vm3: error: get_param: param_path 'lustre-OST0000_UUID/nonrotational': No such file or directory
            pdsh@trevis-31vm1: trevis-31vm3: ssh exited with exit code 2
            /usr/lib64/lustre/tests/test-framework.sh: line 11492: [: : integer expression expected
            bs: 1024, file_size 26214400
            ...

            It looks like call to zfs_or_rotational() -> ostname_from_index() has returned "lustre-OST0000 lustre-OST0000_UUID" instead of just "lustre-OST0000"

            arshad512 Arshad Hussain added a comment - From: https://testing.whamcloud.com/test_logs/6315cba4-8e7e-490d-ba12-d4fad74d3784/show_text == sanityn test 16j: race dio with buffered i/o ========== 14:13:36 (1697552016) 1+0 records in 1+0 records out 26214400 bytes (26 MB, 25 MiB) copied, 0.294639 s, 89.0 MB/s CMD: trevis-31vm3 /usr/sbin/lctl get_param -n osd-*.lustre-OST0000 lustre-OST0000_UUID.nonrotational trevis-31vm3: error: get_param: param_path 'lustre-OST0000_UUID/nonrotational': No such file or directory pdsh@trevis-31vm1: trevis-31vm3: ssh exited with exit code 2 /usr/lib64/lustre/tests/test-framework.sh: line 11492: [: : integer expression expected bs: 1024, file_size 26214400 ... It looks like call to zfs_or_rotational() -> ostname_from_index() has returned "lustre-OST0000 lustre-OST0000_UUID" instead of just "lustre-OST0000"

            People

              arshad512 Arshad Hussain
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: