[LU-10570] sanity test_27y: Error: 'Of 2 OSTs, only 1 is available' Created: 26/Jan/18  Updated: 27/Feb/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: James A Simmons
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-9019 Migrate lustre to standard 64 bit tim... Resolved
is related to LU-10576 sanity test_803: before (425) - 8 < a... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

== sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 06:05:43 (1516946743)
CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_last_id
CMD: onyx-37vm9 lctl get_param -n osc.lustre-OST0000-osc-MDT0000.prealloc_next_id
CMD: onyx-37vm9 lctl dl
lustre-OST0001-osc-MDT0000 is Deactivated:
CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 deactivate
lustre-OST0000 is degraded:
CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=1
CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
total: 2 open/close in 0.00 seconds: 437.82 ops/second
lustre-OST0000 is recovered from degraded:
CMD: onyx-37vm8 lctl set_param -n obdfilter.lustre-OST0000.degraded=0
CMD: onyx-37vm9 lctl --device %lustre-OST0001-osc-MDT0000 activate
CMD: onyx-37vm9 lctl get_param -n lov.*.qos_maxage
sanity test_27y: @@@@@@ FAIL: Of 2 OSTs, only 1 is available
Trace dump:
= /usr/lib64/lustre/tests/test-framework.sh:5336:error()
= /usr/lib64/lustre/tests/sanity.sh:1840:test_27y()
= /usr/lib64/lustre/tests/test-framework.sh:5612:run_one()
= /usr/lib64/lustre/tests/test-framework.sh:5651:run_one_logged()
= /usr/lib64/lustre/tests/test-framework.sh:5498:run_test()
= /usr/lib64/lustre/tests/sanity.sh:1843:main()

This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com>

This issue relates to the following test suite run:

<<Please provide additional information about the failure here>>



 Comments   
Comment by Jinshan Xiong (Inactive) [ 26/Jan/18 ]

This issue should be introduced by a recent commit because I have never seen it before

Comment by James Nunez (Inactive) [ 29/Jan/18 ]

We've seen this test fail four times since January 26, 2017 and all of those failures are in review-zfs. Here are links to some of the logs for these failures:
https://testing.hpdd.intel.com/test_sets/7e9510c2-050c-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/4ba8b3e8-0486-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/de800ac0-02da-11e8-a6ad-52540065bddc

Comment by John Hammond [ 29/Jan/18 ]

Bisection shows that this was introduced by https://review.whamcloud.com/30867 LU-9019 libcfs: remove cfs_time_XXX_64 wrappers.

Comment by Peter Jones [ 30/Jan/18 ]

James

Should we revert this change or is it possible to easily fix this issue?

Peter

Comment by James A Simmons [ 30/Jan/18 ]

It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs.

Comment by James A Simmons [ 30/Jan/18 ]

Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily?

Comment by Jinshan Xiong (Inactive) [ 30/Jan/18 ]

Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X, like:

dd if=/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191
Comment by John Hammond [ 31/Jan/18 ]

James,

I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done.

For the FSTYPE=zfs issue, see LU-10424.

Comment by James A Simmons [ 01/Feb/18 ]

I have a patch cooked up. Will push after I'm done testing.

Comment by Gerrit Updater [ 01/Feb/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127
Subject: LU-10570 obd: use ktime_t for statfs handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4441d88f18882bf3b1441ddf02ea59f843ca207

Comment by Jinshan Xiong (Inactive) [ 01/Feb/18 ]

James - can you explain why ktime_t change would cause the problem?

Comment by Gerrit Updater [ 02/Feb/18 ]

James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158
Subject: LU-10570 obd: fix statfs handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b6cd8307c488cffe2bbd5819b42583ab340610f7

Comment by James A Simmons [ 02/Feb/18 ]

So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach.  What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased.

So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior  changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach.

Comment by Jian Yu [ 08/Feb/18 ]

+1 on master branch:
https://testing.hpdd.intel.com/test_sets/28ef2abe-0ccc-11e8-a7cd-52540065bddc

Comment by James A Simmons [ 08/Feb/18 ]

I have a fix at https://review.whamcloud.com/#/c/31158/

Comment by Bob Glossman (Inactive) [ 09/Feb/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/497dd194-0d2c-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/2c909e06-0dbb-11e8-bd00-52540065bddc

Comment by Bob Glossman (Inactive) [ 22/Feb/18 ]

more on master:
https://testing.hpdd.intel.com/test_sets/9ff21932-17f3-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/9ea2f9ec-1808-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/7bcf5f50-1817-11e8-bd00-52540065bddc

Comment by James A Simmons [ 22/Feb/18 ]

Patch is in master-next so the fix should land soon.

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31158/
Subject: LU-10570 obd: fix statfs handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 87577f4988c1814dae1a1274880e20f1991e7b94

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Generated at Sat Feb 10 02:36:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.