HSM _not only_ small fixes and to do list goes here (LU-3647)

[LU-3852] sanity-hsm test_251: client26-vm "dd: no space left on device" Created: 28/Aug/13  Updated: 03/Apr/16  Resolved: 07/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.5.0, Lustre 2.7.0

Type: Technical task Priority: Blocker
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM

Issue Links:
Blocker
Related
is related to LU-6055 sanity-hsm file_creation_failure send... Resolved
is related to LU-4178 Test failure on test suite sanity-hsm... Resolved
Rank (Obsolete): 9973

 Description   

This issue was created by maloo for Minh Diep <minh.diep@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/3c70376c-0fbb-11e3-bb21-52540035b04c.

The sub-test test_251 failed with the following error:

request on sanity-hsm is not @@@@@@

== sanity-hsm test 251: Coordinator request timeout == 00:59:34 (1377676774)
Purging archive
Starting copytool
lhsmtool_posix --hsm-root /tmp/arc --daemon --bandwidth 1 /mnt/lustre
dd: writing `/mnt/lustre2/d0.sanity-hsm/d251/f.sanity-hsm.251': No space left on device
93+0 records in
92+0 records out
96620544 bytes (97 MB) copied, 16.7614 s, 5.8 MB/s
CMD: client-27vm3,client-27vm4,client-27vm5,client-27vm6.lab.whamcloud.com /usr/sbin/lctl dk > /logdir/test_logs/2013-08-27/lustre-reviews-el6-x86_64-review-2_4_1_17733_-70153022088660-162844/sanity-hsm.test_251.debug_log.\$(hostname -s).1377676791.log;
dmesg > /logdir/test_logs/2013-08-27/lustre-reviews-el6-x86_64-review-2_4_1_17733_-70153022088660-162844/sanity-hsm.test_251.dmesg.\$(hostname -s).1377676791.log
CMD: client-27vm3 /usr/sbin/lctl set_param mdt.lustre-MDT0000.hsm_control=disabled
mdt.lustre-MDT0000.hsm_control=disabled
Info required for matching: sanity-hsm 251



 Comments   
Comment by Jinshan Xiong (Inactive) [ 28/Aug/13 ]

patch is at http://review.whamcloud.com/7484

Comment by Jian Yu [ 29/Aug/13 ]

More instance:
https://maloo.whamcloud.com/test_sets/faa96ad6-0f85-11e3-9bce-52540035b04c

Comment by Bobbie Lind (Inactive) [ 03/Sep/13 ]

Another instance just incase it's needed. https://maloo.whamcloud.com/test_sets/9a710002-0fcb-11e3-a63c-52540035b04c

Comment by nasf (Inactive) [ 09/Sep/13 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/a4ed5d50-189f-11e3-aa54-52540035b04c

Comment by James Nunez (Inactive) [ 26/Sep/13 ]

Reopening tickets due to 'No space left on device' failures seen in sanity-hsm again. All tests fail with error 'request on sanity-hsm is not @@@@@@'. https://maloo.whamcloud.com/test_sets/16a0a15e-2639-11e3-8d26-52540035b04c - tests 28, 104, 110b and 251 and
https://maloo.whamcloud.com/test_sets/96ba3174-265d-11e3-8d26-52540035b04c - tests 104, 107, 110b and 251.
https://maloo.whamcloud.com/test_sets/872eeff2-26b6-11e3-94b1-52540035b04c - test 28, 30b, 31b and 251

Comment by Jinshan Xiong (Inactive) [ 26/Sep/13 ]

Hi James,

Does your environment include the patch in LU-3852?

LU-3852 tests: remove large files created by HSM test

Delete files created by HSM test cases with size bigger than 10M.

Jinshan

Comment by James Nunez (Inactive) [ 26/Sep/13 ]

Yes, I'm using 2.4.93 build # 1687 and that includes the LU-3852 patch.

Comment by Jinshan Xiong (Inactive) [ 30/Sep/13 ]

After checking with James, it turned out there was an issue with setup. I will close it again.

Comment by Doug Oucharek (Inactive) [ 05/Dec/13 ]

I just looked at a Maloo failure which has test_251 failing this way. Maloo stats indicate only a 50% pass rate right now. Looks like this issue is back.

Here is the Maloo failure I was looking at: https://maloo.whamcloud.com/test_sessions/250897b4-5d1c-11e3-956b-52540035b04c

Comment by Bruno Faccini (Inactive) [ 09/Dec/13 ]

Again the root cause of the "sanity-hsm test_251: @@@@@@ FAIL: request on sanity-hsm is not @@@@@@" symptom/failure for this ticket, is the "dd: writing `/mnt/lustre2/d0.sanity-hsm/d251/f.sanity-hsm.251': No space left on device" error.

Since 2013-09-25 21:03:00, date of last problem's occurrence before Change #7484 has landed, there are only 5 occurrences reported by Maloo stats since 2013-12-05 13:03:17, 2 on client-26vm2 and 3 on client-26vm6.

Comment by Bruno Faccini (Inactive) [ 16/Dec/13 ]

More occurrences and still/only on client-26vm* where Lustre file-system size appear much lower than on other Nodes and likely to fill when creating a big/103MB file in test_251 :

bruno@brent:~$ ssh root@client-26vm6 df /mnt/lustre
root@client-26vm6's password: 
Filesystem           1K-blocks      Used Available Use% Mounted on
client-26vm3@tcp:/lustre
                       1464484    191516   1194876  14% /mnt/lustre
bruno@brent:~$ ssh root@client-27vm6 df /mnt/lustre
root@client-27vm6's password: 
Filesystem           1K-blocks      Used Available Use% Mounted on
client-27vm3@tcp:/lustre
                      14449456    797740  12917724   6% /mnt/lustre
bruno@brent:~$ 

Opened TEI-1289 for this issue.

Comment by Andreas Dilger [ 26/Dec/13 ]

This subtest has been disabled at the autotest level, so a regular test run will skip it. You need to explicitly request testing on the subtest - hopefully Test-Parameters works.

Comment by Bruno Faccini (Inactive) [ 27/Dec/13 ]

Oops, sorry but I think I forgot to indicate that I have created TEI-1289 to address client-26vm* very small sized Lustre filesystem issue.

Comment by Bruno Faccini (Inactive) [ 04/Jan/14 ]

Andreas, I am not sure that test_251 has been "fully" disabled at the autotest level because I still see runs of it for recent patch submissions, and there has been at least one more failure for this same issue reported on December 28th at https://maloo.whamcloud.com/test_sets/87584406-6fbd-11e3-9a1b-52540035b04c.

Thus, I will raise priority of TEI-1289 to get some update.

Comment by Andreas Dilger [ 24/Jan/14 ]

the sanity-hsm test_251 may only be disabled for tests on master, which is ok I think.

Could you please submit a patch to add it to ALWAYS_EXCEPT in the script. When that lands we can remove it from the autotest config so that it will be possible to re-enable it for any patch that is trying to fix the problem.

Comment by Bruno Faccini (Inactive) [ 27/Jan/14 ]

Patch to disable test_251 internally in sanity-hsm has been pushed at http://review.whamcloud.com/9014.
When it will land, I suppose I will need to open a TEI to request current auto-tests config settings to be reversed, like for similar situation with LU-3939/TEI-570/TEI-1041 ?

Comment by Bruno Faccini (Inactive) [ 11/Mar/14 ]

Requested Gerrit gate-keeper to land patch #9014, next actions will be to re-allow test_251 in autotest config (thru a new TEI?), and then to push a patch that find a way to handle this "small fs size vs file big enough for timing need" requirement, and also to re-allow test_251 (and also test_[200,221,223b] disabled for the same issue in LU-4178) in same patch.

Comment by Jodi Levi (Inactive) [ 12/Mar/14 ]

Test has been retriggered: unclear what happened to results

Comment by Bruno Faccini (Inactive) [ 21/Mar/14 ]

Patch #9014 to disable test_251 internally in sanity-hsm has landed. So now, will be able to work on a definitive patch to strengthen/protect test_251 vs too small-sized Lustre FS.

Comment by Bruno Faccini (Inactive) [ 28/Oct/14 ]

Sorry to have been soooo late on this ...
Patch to strengthen sanity-hsm sub-tests, using large files and easy victims of auto-tests configurations with reduced Lustre FS size, is at http://review.whamcloud.com/12456.

Comment by Bruno Faccini (Inactive) [ 02/Dec/14 ]

I forgot to indicate that change #12456 also re-enables sanity-hsm/test_[200,221,223b,251] sub-tests.

Comment by Gerrit Updater [ 17/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12456/
Subject: LU-3852 tests: skip tests with large file when no room
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3a63fe4aa37ed72dc2f9a2c85c35035c2c2619ba

Comment by Bruno Faccini (Inactive) [ 07/Jan/15 ]

Patch has landed.

Comment by Gerrit Updater [ 19/Feb/15 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: http://review.whamcloud.com/13803
Subject: LU-3852 tests: skip tests with large file when no room
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: d6b1797ac0e0f9ba89688e1fcb986406b65e3418

Generated at Sat Feb 10 01:37:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.