Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10570

sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 06:05:43 (1516946743)
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_next_id
      CMD: onyx-37vm9 lctl dl
      lustre-OST0001-osc-MDT0000 is Deactivated:
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 deactivate
      lustre-OST0000 is degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=1
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      total: 2 open/close in 0.00 seconds: 437.82 ops/second
      lustre-OST0000 is recovered from degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=0
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 activate
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      sanity test_27y: @@@@@@ FAIL: Of 2 OSTs, only 1 is available
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
      = /usr/lib64/lustre/tests/sanity.sh:1840:test_27y()
      = /usr/lib64/lustre/tests/test-framework.sh:5612:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5651:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:5498:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:1843:main()

      This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

      This issue relates to the following test suite run:

      <<Please provide additional information about the failure here>>

      Attachments

        Issue Links

          Activity

            [LU-10570] sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'

            James - can you explain why ktime_t change would cause the problem?

            jay Jinshan Xiong (Inactive) added a comment - James - can you explain why ktime_t change would cause the problem?

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127
            Subject: LU-10570 obd: use ktime_t for statfs handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e4441d88f18882bf3b1441ddf02ea59f843ca207

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127 Subject: LU-10570 obd: use ktime_t for statfs handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e4441d88f18882bf3b1441ddf02ea59f843ca207

            I have a patch cooked up. Will push after I'm done testing.

            simmonsja James A Simmons added a comment - I have a patch cooked up. Will push after I'm done testing.
            jhammond John Hammond added a comment -

            James,

            I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done.

            For the FSTYPE=zfs issue, see LU-10424.

            jhammond John Hammond added a comment - James, I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done . For the FSTYPE=zfs issue, see LU-10424 .
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X, like:

            dd if=/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191
            
            jay Jinshan Xiong (Inactive) added a comment - - edited Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X , like: dd if =/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191

            Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily?

            simmonsja James A Simmons added a comment - Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily?
            simmonsja James A Simmons added a comment - - edited

            It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs.

            simmonsja James A Simmons added a comment - - edited It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs.
            pjones Peter Jones added a comment -

            James

            Should we revert this change or is it possible to easily fix this issue?

            Peter

            pjones Peter Jones added a comment - James Should we revert this change or is it possible to easily fix this issue? Peter

            Bisection shows that this was introduced by https://review.whamcloud.com/30867 LU-9019 libcfs: remove cfs_time_XXX_64 wrappers.

            jhammond John Hammond added a comment - Bisection shows that this was introduced by https://review.whamcloud.com/30867 LU-9019 libcfs: remove cfs_time_XXX_64 wrappers.

            We've seen this test fail four times since January 26, 2017 and all of those failures are in review-zfs. Here are links to some of the logs for these failures:
            https://testing.hpdd.intel.com/test_sets/7e9510c2-050c-11e8-a10a-52540065bddc
            https://testing.hpdd.intel.com/test_sets/4ba8b3e8-0486-11e8-a7cd-52540065bddc
            https://testing.hpdd.intel.com/test_sets/de800ac0-02da-11e8-a6ad-52540065bddc

            jamesanunez James Nunez (Inactive) added a comment - We've seen this test fail four times since January 26, 2017 and all of those failures are in review-zfs. Here are links to some of the logs for these failures: https://testing.hpdd.intel.com/test_sets/7e9510c2-050c-11e8-a10a-52540065bddc https://testing.hpdd.intel.com/test_sets/4ba8b3e8-0486-11e8-a7cd-52540065bddc https://testing.hpdd.intel.com/test_sets/de800ac0-02da-11e8-a6ad-52540065bddc

            This issue should be introduced by a recent commit because I have never seen it before

            jay Jinshan Xiong (Inactive) added a comment - This issue should be introduced by a recent commit because I have never seen it before

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: