[LU-10570] sanity test_27y: Error: 'Of 2 OSTs, only 1 is available' Created: 26/Jan/18 Updated: 27/Feb/18 Resolved: 27/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | James A Simmons |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
== sanity test 27y: create files while OST0 is degraded and the rest inactive ======================== 06:05:43 (1516946743) This issue was created by maloo for Jinshan Xiong <jinshan.xiong@intel.com> This issue relates to the following test suite run: <<Please provide additional information about the failure here>> |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 26/Jan/18 ] |
|
This issue should be introduced by a recent commit because I have never seen it before |
| Comment by James Nunez (Inactive) [ 29/Jan/18 ] |
|
We've seen this test fail four times since January 26, 2017 and all of those failures are in review-zfs. Here are links to some of the logs for these failures: |
| Comment by John Hammond [ 29/Jan/18 ] |
|
Bisection shows that this was introduced by https://review.whamcloud.com/30867 |
| Comment by Peter Jones [ 30/Jan/18 ] |
|
James Should we revert this change or is it possible to easily fix this issue? Peter |
| Comment by James A Simmons [ 30/Jan/18 ] |
|
It should easy to fix. I have a feeling its one of those using seconds resolution is not good enough. I bet only seen on VMs. |
| Comment by James A Simmons [ 30/Jan/18 ] |
|
Can't reproduce on ldiskfs locally. Attempted to bring up ZFS for test suite but I have run into some issues with setup. I get device is to small. smaller than 64MB issues. Looking I don't see any docs on how to setup the test suite with ZFS. Any pointers? Can anyone reproduce this easily? |
| Comment by Jinshan Xiong (Inactive) [ 30/Jan/18 ] |
|
Saw the same issue before. From my side, it worked after I just used dd to write a bigger Lustre target files of /tmp/lustre-{ost,mdt}X, like: dd if=/dev/zero of=/tmp/lustre-mdt1 bs=1M count=1 seek=8191
|
| Comment by John Hammond [ 31/Jan/18 ] |
|
James, I was able to reproduce this using a llmount.sh filesystem with FSTYPE=ldiskfs on a single RHEL 7.4 VM. It didn't fail every time but it would consistently fail within a minute when I ran while ONLY=27y bash lustre/tests/sanity.sh; do true; done. For the FSTYPE=zfs issue, see LU-10424. |
| Comment by James A Simmons [ 01/Feb/18 ] |
|
I have a patch cooked up. Will push after I'm done testing. |
| Comment by Gerrit Updater [ 01/Feb/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31127 |
| Comment by Jinshan Xiong (Inactive) [ 01/Feb/18 ] |
|
James - can you explain why ktime_t change would cause the problem? |
| Comment by Gerrit Updater [ 02/Feb/18 ] |
|
James Simmons (uja.ornl@yahoo.com) uploaded a new patch: https://review.whamcloud.com/31158 |
| Comment by James A Simmons [ 02/Feb/18 ] |
|
So in the original code handling the stat refreshing was using 64 bit jiffies which means we usually have time resolution in the milliseconds. Because it was the 64 bit version of jiffies didn't mean it has better time resolution like ktime_t does. Looking at the code it feels natural to use seconds resolution since OBD_STATFS_CACHE_SECONDS is one second and the qos max_age is also in seconds which is why I moved in that direction. Also the comments above obd_statfs() pointed to a more second resolution approach. What is causing the pain is that in lod_qos_statfs_updates() we test twice if the stats need to be refreshed due to the millisecond resolution. One before lq_rw_sem is taken and then again after taking the semaphore. For the case of using jiffies level resolution the chances that condition one is false and condition two is true is pretty slim. When the code moved to using time64_t that change greatly increased. So I have approach this problem in two ways. First one was to move to ktime_t and maintain the original behavior of the code. The second was to remove the second test for the need to refresh the cache which seems to work. I have both options posted seen their might be other behavior changes with the removal of the second stale stats test in lod_qos_statfs_updates(). We can ponder which is the better approach. |
| Comment by Jian Yu [ 08/Feb/18 ] |
|
+1 on master branch: |
| Comment by James A Simmons [ 08/Feb/18 ] |
|
I have a fix at https://review.whamcloud.com/#/c/31158/ |
| Comment by Bob Glossman (Inactive) [ 09/Feb/18 ] |
|
more on master: |
| Comment by Bob Glossman (Inactive) [ 22/Feb/18 ] |
|
more on master: |
| Comment by James A Simmons [ 22/Feb/18 ] |
|
Patch is in master-next so the fix should land soon. |
| Comment by Gerrit Updater [ 27/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31158/ |
| Comment by Peter Jones [ 27/Feb/18 ] |
|
Landed for 2.11 |