[LU-1400] obdfilter-survey test_1c failed but still green in report Created: 13/May/12  Updated: 17/Apr/13  Resolved: 29/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Mikhail Pershin Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-352 obdecho don't work with verify mode Resolved
Severity: 3
Rank (Obsolete): 4479

 Description   

Looking through obdfilter-survey 1c results I noticed that usually it takes more than 3000s to complete but often it is about 1700s only. All such runs ended up with error in fact but were kept as green so failure is not noticed:

== obdfilter-survey test 1c: Object Storage Targets survey, big batch ================================ 22:23:42 (1336800222)
+ NETTYPE=tcp thrlo=128 nobjhi=1 thrhi=128 size=8192 case=disk rslt_loc=/tmp targets="10.10.4.77:lustre-OST0000 10.10.4.77:lustre-OST0001 10.10.4.77:lustre-OST0002 10.10.4.77:lustre-OST0003 10.10.4.77:lustre-OST0004 10.10.4.77:lustre-OST0005 10.10.4.77:lustre-OST0006" /usr/bin/obdfilter-survey
Fri May 11 22:23:48 PDT 2012 Obdfilter-survey for case=disk from fat-intel-1vm2
ost  7 sz 58720256K rsz 1024K obj    7 thr  896 write  248.99             ERROR rewrite 119900.58             ERROR read  176.78 [   0.00, 239.23] 
done!

The same about test 2a which lasts only 2s:

== obdfilter-survey test 2a: Stripe F/S over the Network == 23:11:25 (1336025485)
+ NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=netdisk rslt_loc=/tmp targets="172.29.3.12:lustre-OST0000 172.29.3.12:lustre-OST0001 172.29.3.12:lustre-OST0002 172.29.3.12:lustre-OST0003 172.29.3.12:lustre-OST0004 172.29.3.12:lustre-OST0005 172.29.3.12:lustre-OST0006" /usr/bin/obdfilter-survey
Wed May  2 23:11:25 PDT 2012 Obdfilter-survey for case=netdisk from iu-3vm1.lab.whamcloud.com
ost  7 sz  7340032K rsz 1024K obj    7 thr   56 write 260954.00             ERROR rewrite 266418.29             ERROR read 269207.02             ERROR 
ost  7 sz  7340032K rsz 1024K obj    7 thr  112 write 162454.32             ERROR rewrite 162440.28             ERROR read 162512.28             ERROR 
done!

https://maloo.whamcloud.com/sub_tests/ccc4725c-9c20-11e1-8837-52540035b04c

Only interop runs with 2.1 client work normally (check 2a results as 1c test is not in 2.1 yet):
https://maloo.whamcloud.com/sub_tests/97ddb736-9321-11e1-9e8b-525400d2bfa6

Therefore we have two issues there: 1) obdfilter-survey reporting issue, it stays green always 2) echo client issue causing errors



 Comments   
Comment by Peter Jones [ 18/May/12 ]

Keith

Could you please look into this one?

Thanks

Peter

Comment by Keith Mannthey (Inactive) [ 26/Jul/12 ]

Sorry for the delay.
I have submitted a possible fix for the always green issue: http://review.whamcloud.com/#change,3482. It needs to be tested but the return codes were not properly handled.

I don't quite understand the 2nd error "2) echo client issue causing errors" Can you clarify what you mean?

Comment by Keith Mannthey (Inactive) [ 09/Aug/12 ]

It looks as thought the test itself is working correctly but the /usr/bin/ part is not handling an ENOSPACE error correctly. The systems are running out of desk space but the the /usr/bin/obdfilter-survey script does not pass the failure along. I have added a flag to make the survey itself stop on an error and I am working on testing the change.

Comment by Keith Mannthey (Inactive) [ 21/Aug/12 ]

http://review.whamcloud.com/#change,3591 is looking good for acceptance.

Comment by Keith Mannthey (Inactive) [ 29/Aug/12 ]

There is a patch to fix the all is green issue in master. Please reopen if there is still an error.

Generated at Sat Feb 10 01:16:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.