Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10570

sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      == sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 06:05:43 (1516946743)
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
      CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_next_id
      CMD: onyx-37vm9 lctl dl
      lustre-OST0001-osc-MDT0000 is Deactivated:
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 deactivate
      lustre-OST0000 is degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=1
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      total: 2 open/close in 0.00 seconds: 437.82 ops/second
      lustre-OST0000 is recovered from degraded:
      CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=0
      CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 activate
      CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
      sanity test_27y: @@@@@@ FAIL: Of 2 OSTs, only 1 is available
      Trace dump:
      = /usr/lib64/lustre/tests/test-framework.sh:5336:error()
      = /usr/lib64/lustre/tests/sanity.sh:1840:test_27y()
      = /usr/lib64/lustre/tests/test-framework.sh:5612:run_one()
      = /usr/lib64/lustre/tests/test-framework.sh:5651:run_one_logged()
      = /usr/lib64/lustre/tests/test-framework.sh:5498:run_test()
      = /usr/lib64/lustre/tests/sanity.sh:1843:main()

      This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

      This issue relates to the following test suite run:

      <<Please provide additional information about the failure here>>

      Attachments

        Issue Links

          Activity

            [LU-10570] sanity test_27y: Error: 'Of 2 OSTs, only 1 is available'
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31158/
            Subject: LU-10570 obd: fix statfs handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 87577f4988c1814dae1a1274880e20f1991e7b94

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31158/ Subject: LU-10570 obd: fix statfs handling Project: fs/lustre-release Branch: master Current Patch Set: Commit: 87577f4988c1814dae1a1274880e20f1991e7b94

            Patch is in master-next so the fix should land soon.

            simmonsja James A Simmons added a comment - Patch is in master-next so the fix should land soon.
            bogl Bob Glossman (Inactive) added a comment - - edited more on master: https://testing.hpdd.intel.com/test_sets/9ff21932-17f3-11e8-a7cd-52540065bddc https://testing.hpdd.intel.com/test_sets/9ea2f9ec-1808-11e8-a10a-52540065bddc https://testing.hpdd.intel.com/test_sets/7bcf5f50-1817-11e8-bd00-52540065bddc
            bogl Bob Glossman (Inactive) added a comment - - edited more on master: https://testing.hpdd.intel.com/test_sets/497dd194-0d2c-11e8-bd00-52540065bddc https://testing.hpdd.intel.com/test_sets/2c909e06-0dbb-11e8-bd00-52540065bddc
            simmonsja James A Simmons added a comment - I have a fix at  https://review.whamcloud.com/#/c/31158/
            yujian Jian Yu added a comment - +1 on master branch: https://testing.hpdd.intel.com/test_sets/28ef2abe-0ccc-11e8-a7cd-52540065bddc

            So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach.  What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased.

            So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior  changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach.

            simmonsja James A Simmons added a comment - So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach.  What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased. So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior  changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach.

            James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158
            Subject: LU-10570 obd: fix statfs handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b6cd8307c488cffe2bbd5819b42583ab340610f7

            gerrit Gerrit Updater added a comment - James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158 Subject: LU-10570 obd: fix statfs handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b6cd8307c488cffe2bbd5819b42583ab340610f7

            James - can you explain why ktime_t change would cause the problem?

            jay Jinshan Xiong (Inactive) added a comment - James - can you explain why ktime_t change would cause the problem?

            People

              simmonsja James A Simmons
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: