[LU-10663] obdfilter-survey Created: 14/Feb/18  Updated: 12/Mar/18  Resolved: 27/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.3
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: None

Attachments: File obd.debug.out1.gz    
Issue Links:
Related
is related to LU-10566 parallel-scale-nfsv4 test_metabench: ... Reopened
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When run obdfilter-survey it would only do one test at a time. After some debugging I traced the issue to destroy_objects function. Here is the errors.

remote_shell localhost lctl --device 17 destroy 444 1
error: destroy: invalid objid '444'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>
remote_shell localhost lctl --device 18 destroy 444 1
error: destroy: invalid objid '444'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>
remote_shell localhost lctl --device 19 destroy 444 1
error: destroy: invalid objid '444'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>
remote_shell localhost lctl --device 20 destroy 444 1
error: destroy: invalid objid '444'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>
remote_shell localhost lctl --device 21 destroy 444 1
error: destroy: invalid objid '444'
destroy OST object <objid> [num [verbose]]
usage: destroy <num> objects, starting at objid <objid>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>

 

Any ideas what could be causing this?



 Comments   
Comment by Bruno Faccini (Inactive) [ 14/Feb/18 ]

Hello Mahmoud,
did you interrupt the obdfilter_survey script before getting these errors ?
I am asking because I am presently doing some re-work (as pert of LU-9730) in the obdfilter_survey framework to strengthen it particularly in its auto-cleanup duty upon normal and interrupted cases.

Can you also detail the command line/parameters you have used and also the configuration (single node setup?, direct run on OSS ?, ...) being used ?

Did you also check of any "/tmp//obdfilter_survey_*" left files ?

Comment by Mahmoud Hanafi [ 14/Feb/18 ]

This start when I updated from 2.10.1 to 2.10.3. I had ran obdfilter-survey both local(oss) and netdisk(2 osses) and I had interrupted it before in 2.10.1.

I checked the file in /tmp didn't find anything.

cmd line

rszlo=1024 rszhi=4096 size=5000 obdfilter-survey 

When run obdfilter-survey on a new OST running 2.10.3 the test only runs 1 time. I coped obdfilter-survey and iokit-libecho from a 2.10.1 server still had the issue.

I  downgraded the OSS back to 2.10.1 and obdfilter_survey runs without errors.

nbp7-mds2 ~ # rszlo=1024 rszhi=4096 size=10000 tests_str="write" obdfilter-survey

Wed Feb 14 09:32:41 PST 2018 Obdfilter-survey for case=disk from nbp7-mds2
ost 3 sz 30720000K rsz 1024K obj 3 thr 3 write 1096.05 [ 335.99, 448.97] 
ost 3 sz 30720000K rsz 1024K obj 3 thr 6 write 1163.57 [ 377.99, 754.98] 
ost 3 sz 30720000K rsz 1024K obj 3 thr 12 write 1163.78 [ 383.00, 776.98] 
ost 3 sz 30720000K rsz 1024K obj 3 thr 24 write 1164.07 [ 385.99, 778.96] 
ost 3 sz 30720000K rsz 1024K obj 3 thr 48 write 1164.01 [ 384.99, 777.39] 
....

So the bug is in 2.10.3 lctl!

Can I get the priority of this case increased 1 level.**

Comment by Mahmoud Hanafi [ 14/Feb/18 ]

copied lctl from a 2.10.1 server to a 2.10.3 obdfilter-survey works. Attaching debug output

Comment by Gerrit Updater [ 14/Feb/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31305
Subject: LU-10663 utils: clear errno before check
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 70e9e7bcf4505cba8853117e2eeb92a01e399eec

Comment by Gerrit Updater [ 27/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31305/
Subject: LU-10663 utils: clear errno before check
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9e488fe9413184e61dcf405c9c87ca348dd6824a

Comment by Peter Jones [ 27/Feb/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 27/Feb/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31430
Subject: LU-10663 utils: clear errno before check
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 800da3bd685aa2ad4c4ed730ac86d79f7693bfc1

Comment by Gerrit Updater [ 12/Mar/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31430/
Subject: LU-10663 utils: clear errno before check
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: d140cab6f9bbef3d7f77b91628fe8202517fa185

Generated at Sat Feb 10 02:37:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.