Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10570

sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 06:05:43 (1516946743)
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_next_id
      CMD: onyx-37vm9 lctl dl
      lustre-OST0001-osc-MDT0000 is Deactivated:
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 deactivate
      lustre-OST0000 is degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=1
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      total: 2 open/close in 0.00 seconds: 437.82 ops/second
      lustre-OST0000 is recovered from degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=0
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 activate
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      sanity test_27y: @@@@@@ FAIL: Of 2 OSTs, only 1 is available
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
      = /usr/lib64/lustre/tests/sanity.sh:1840:test_27y()
      = /usr/lib64/lustre/tests/test-framework.sh:5612:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5651:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:5498:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:1843:main()

      This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

      This issue relates to the following test suite run:

      <<Please provide additional information about the failure here>>

      Attachments

        Issue Links

          Activity

            [LU-10570] sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'
            yujian Jian Yu added a comment - +1 on master branch: https://testing.hpdd.intel.com/test_sets/28ef2abe-0ccc-11e8-a7cd-52540065bddc

            So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach.  What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased.

            So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior  changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach.

            simmonsja James A Simmons added a comment - So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach.  What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased. So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior  changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach.

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158
            Subject: LU-10570 obd: fix statfs handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b6cd8307c488cffe2bbd5819b42583ab340610f7

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158 Subject: LU-10570 obd: fix statfs handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b6cd8307c488cffe2bbd5819b42583ab340610f7

            James - can you explain why ktime_t change would cause the problem?

            jay Jinshan Xiong (Inactive) added a comment - James - can you explain why ktime_t change would cause the problem?

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127
            Subject: LU-10570 obd: use ktime_t for statfs handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e4441d88f18882bf3b1441ddf02ea59f843ca207

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127 Subject: LU-10570 obd: use ktime_t for statfs handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e4441d88f18882bf3b1441ddf02ea59f843ca207

            I have a patch cooked up. Will push after I'm done testing.

            simmonsja James A Simmons added a comment - I have a patch cooked up. Will push after I'm done testing.
            jhammond John Hammond added a comment -

            James,

            I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done.

            For the FSTYPE=zfs issue, see LU-10424.

            jhammond John Hammond added a comment - James, I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done . For the FSTYPE=zfs issue, see LU-10424 .
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X, like:

            dd if=/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191
            
            jay Jinshan Xiong (Inactive) added a comment - - edited Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X , like: dd if =/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191

            Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily?

            simmonsja James A Simmons added a comment - Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily?
            simmonsja James A Simmons added a comment - - edited

            It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs.

            simmonsja James A Simmons added a comment - - edited It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs.
            pjones Peter Jones added a comment -

            James

            Should we revert this change or is it possible to easily fix this issue?

            Peter

            pjones Peter Jones added a comment - James Should we revert this change or is it possible to easily fix this issue? Peter

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: