[LU-2899] OSTs can't be used correctly after running sanity test_27y Created: 04/Mar/13 Updated: 12/Mar/13 Resolved: 12/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Emoly Liu | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 6986 |
| Description |
|
This problem was found during In that test we have 7 OSTs, but after running sanity test_27y, only 2 OSTs were available. This problem is easy to reproduce by adding some scripts in sanity test_27y diff --git a/lustre/tests/sanity.sh b/lustre/tests/sanity.sh
index 16a0410..d28a4ee 100644
--- a/lustre/tests/sanity.sh
+++ b/lustre/tests/sanity.sh
@@ -1577,6 +1577,11 @@ test_27x() {
run_test 27x "create files while OST0 is degraded"
test_27y() {
+ local testfile="/mnt/lustre/testfile"
+ $SETSTRIPE -i 0 -c -1 $testfile
+ $GETSTRIPE $testfile
+ rm -fv $testfile
+
[ "$OSTCOUNT" -lt "2" ] && skip_env "$OSTCOUNT < 2 OSTs -- skipping" && return
remote_mds_nodsh && skip "remote MDS with nodsh" && return
remote_ost_nodsh && skip "remote OST with nodsh" && return
@@ -1638,6 +1643,10 @@ test_27y() {
do_facet $SINGLEMDS lctl --device %$OSC activate
fi
done
+
+ $SETSTRIPE -i 0 -c -1 $testfile
+ $GETSTRIPE $testfile
+ rm -fv $testfile
}
run_test 27y "create files while OST0 is degraded and the rest inactive"
When OSTCOUNT=4, the output is like == sanity test 27y: create files while OST0 is degraded and the rest inactive == 22:21:07 (1362147667) /mnt/lustre/testfile lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 12 0xc 0 1 66 0x42 0 2 65 0x41 0 3 65 0x41 0 removed `/mnt/lustre/testfile' lustre-OST0001-osc-MDT0000 is Deactivated: lustre-OST0002-osc-MDT0000 is Deactivated: lustre-OST0003-osc-MDT0000 is Deactivated: lustre-OST0000 is degraded: total: 4 creates in 0.01 seconds: 459.66 creates/second lustre-OST0000 is recovered from degraded: /mnt/lustre/testfile lmm_stripe_count: 1 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group 0 17 0x11 0 removed `/mnt/lustre/testfile' Resetting fail_loc on all nodes...done. PASS 27y (11s) |
| Comments |
| Comment by Alex Zhuravlev [ 04/Mar/13 ] |
|
are you saying OST000[1-3] are still inactive after 27y ? |
| Comment by Emoly Liu [ 04/Mar/13 ] |
|
No, at least the command "lfs osts" said no. I talked to wangdi and he thought the test script was OK. We need to investigate there is another problem in lod policy. |
| Comment by Zhenyu Xu [ 04/Mar/13 ] |
|
patch tracking at http://review.whamcloud.com/5573 commit message LU-2899 lod: get lod_stripenr correctly Current code relies on lod_statfs_and_check() to count the number of activated LOD targets, while lod::lod_stripenr derivation happens before calling lod_statfs_and_check(), and that makes lod::lod_stripenr not accurate. This patch make sure lod_statfs_and_check() called before updating ::lod_stripenr. |
| Comment by Alex Zhuravlev [ 04/Mar/13 ] |
|
in general I'm against this approach. can you explain why statfs() is not enough ? |
| Comment by Zhenyu Xu [ 04/Mar/13 ] |
|
first, lod_qos_statfs_update() will skip when last statfs check is close enough. second, before load_alloc_xx() is called, the available total possible stripe number has been determined with lod_get_stripecnt(), which happens before any lod_statfs_and_check() is called. |
| Comment by Alex Zhuravlev [ 04/Mar/13 ] |
|
lod_statfs_and_check() is called to probe every OSP, so I don't quite follow the first argument. |
| Comment by Zhenyu Xu [ 04/Mar/13 ] |
|
sorry, you are right, the first argument does not stand valid. I need change commit message. |
| Comment by Zhenyu Xu [ 04/Mar/13 ] |
|
maybe I just need only change lod_get_stripecnt() to ignore lod->lod_desc.ld_active_tgt_count? |
| Comment by Alex Zhuravlev [ 04/Mar/13 ] |
|
this is not about the commit message.. I'm still not convinced this is the right solution. please clarify on the second argument: lod_alloc_xxx() is called when all OSTs are marked active (I think), so subsequent lod_statfs_and_check() should be able to use them ? |
| Comment by Alex Zhuravlev [ 04/Mar/13 ] |
|
ah, I see what you mean. so we get stripes, but later discover new number of active OSTs. |
| Comment by Andreas Dilger [ 11/Mar/13 ] |
|
Alex, can you please confirm that the current version of http://review.whamcloud.com/5573 is fine, and your earlier comments that you don't like the solution is outdated? |
| Comment by Alex Zhuravlev [ 11/Mar/13 ] |
|
Andreas, I've already inspected the patch with + |
| Comment by Jodi Levi (Inactive) [ 11/Mar/13 ] |
|
After discussing with Sarah, this is not blocking her testing and does not need to be tracked as a blocker. Reducing priority. |
| Comment by Peter Jones [ 12/Mar/13 ] |
|
Landed for 2.4 |