[LU-16861] Janitor Testing Fails to copy latest obdfilter-survey (Uses old obdfilter-survey) Created: 01/Jun/23  Updated: 18/Jan/24  Resolved: 18/Jan/24

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Arshad Hussain Assignee: Arshad Hussain
Resolution: Fixed Votes: 0
Labels: None
Environment:

Client: 4.18.0-372.9.1.el8(8.5)
Server: 4.18.0-425.3.1.el8(8.7)


Issue Links:
Gantt End to Start
has to be done before LU-16827 obdfilter-survey: /usr/bin/obdfilter-... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Testing/Fixing LU-16827 it was observerd that the run was failing in janitor testng. However it was passing in maloo. It was noticed that the required changes under lustre-iokit/obdfilter-survey/obdfilter-survey was not getting reflected under janitor. This was leading to janitor always failing while maloo run was passing as the latest (modified) script was getting used. It looks like janitor always used old script (at least for obdfilter-survey)

Here are the few revelant logs. Let me know if more information is requied. All logs are under https://review.whamcloud.com/c/fs/lustre-release/+/51035)

CASE 1: This is modfied to use specific path for obdfilter-survey to make janitor passing.

From logs : https://testing.whamcloud.com/gerrit-janitor/31629/testresults/obdfilter-survey-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/obdfilter-survey.test_1a.test_log.oleg368-client.log

This is the obdfilter-survey janitor always uses

ls -ali /usr/bin/obdfilter-survey
920857 -rwxr-xr-x. 1 root root 16279 Jun 4 2016 /usr/bin/obdfilter-survey

This is the obdfilter-survey which it is supposed to used. (There are changes withing this code -which maloo correctly picks it up). Please notice the binary size and date stamp for both.

 
ls -ali /home/green/git/lustre-release/lustre/../lustre-iokit/obdfilter-survey
total 42
...
2278 -rwxr-xr-x 1 green green 15632 May 31 07:47 obdfilter-survey
...

 

test_1a under obdfilter-survey.sh was modifed with to use specfic  OBDSURVEY instead of generic system path version (which is /usr/bin/obdfilter-survey). When this is done the test passes janitor. Else it fails.

export PATH=$PATH:/home/green/git/lustre-release/lustre/../lustre-iokit/obdfilter-survey
OBDSURVEY=/home/green/git/lustre-release/lustre/../lustre-iokit/obdfilter-survey/obdfilter-survey
obdflter_survey_run disk

CASE 2: The fail case

This uses old /usr/bin/obdfilter-survey and new changes are not reflected.

 

<snip>
+ eval NETTYPE=tcp thrlo=2 nobjhi=1 thrhi=4 size=1024 case=disk rslt_loc=/tmp 'targets="192.168.203.104:lustre-OST0000' '192.168.203.104:lustre-OST0001"' /usr/bin/obdfilter-survey
++ NETTYPE=tcp 
++ thrlo=2 
++ nobjhi=1 
++ thrhi=4 
++ size=1024 
++ case=disk 
++ rslt_loc=/tmp
++ targets='192.168.203.104:lustre-OST0000 192.168.203.104:lustre-OST0001' 
++ /usr/bin/obdfilter-survey
Warning: Permanently added '192.168.203.104' (ECDSA) to the list of known hosts. 
bash: lctl: command not found 
/usr/bin/obdfilter-survey: line 242: ( << 16) | ( << 8) | : syntax error: operand expected (error token is "<< 16) | ( << 8) | ") 
/usr/bin/obdfilter-survey: line 254: [: -lt: unary operator expected 
bash: lctl: command not found 
bash: lctl: command not found 
OST lustre-OST0000 not setup
<snip>

 



 Comments   
Comment by Andreas Dilger [ 09/Jan/24 ]

This is failing 100% of runs on master. It looks like something wrong with the quoting of the targets (note the extra double quotes before each of the targets):

+ NETTYPE=tcp thrlo=2 nobjhi=1 thrhi=4 size=1024 case=disk rslt_loc=/tmp targets=""10.240.26.105:lustre-OST0000 "10.240.26.105:lustre-OST0001 "10.240.26.105:lustre-OST0002 "10.240.26.105:lustre-OST0003 "10.240.26.105:lustre-OST0004 "10.240.26.105:lustre-OST0005 "10.240.26.105:lustre-OST0006 "10.240.26.105:lustre-OST0007" /usr/bin/obdfilter-survey
/usr/lib64/lustre/tests/obdfilter-survey.sh: line 77: 10.240.26.105:lustre-OST0001 10.240.26.105:lustre-OST0002: command not found
cat: '/tmp/obdfilter_survey*': No such file or directory
 obdfilter-survey test_1a: @@@@@@ FAIL: /usr/bin/obdfilter-survey failed: 127 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:6947:error()
  = /usr/lib64/lustre/tests/obdfilter-survey.sh:81:obdflter_survey_run()
  = /usr/lib64/lustre/tests/obdfilter-survey.sh:85:test_1a()
  = /usr/lib64/lustre/tests/test-framework.sh:7287:run_one()
Comment by Arshad Hussain [ 09/Jan/24 ]

I am checking.

Comment by Arshad Hussain [ 09/Jan/24 ]

Looks like failing line is

lctl get_param osc.lustre-OST0000-osc-ffff94bac34eb800.import | awk '/current_connection:/ {sub(/@.*/,""); print $2}'
"192.168.50.95

I am getting the patch

 

 

 

 

 

 

Comment by Gerrit Updater [ 09/Jan/24 ]

"Arshad Hussain <arshad.hussain@aeoncomputing.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53620
Subject: LU-16861 obdfilter: Exclude quotes when getting NID's
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cf525604b66c6d2a5f0e5158727f319d48c6f076

Comment by Arshad Hussain [ 09/Jan/24 ]

This is failing 100% of runs on master. It looks like something wrong with the quoting of the targets (note the extra double quotes before each of the targets):

Andreas, thanks for the hint/pointer. When picking up NID it was including the quotes which was causing the failure.

 

Comment by Gerrit Updater [ 18/Jan/24 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/53620/
Subject: LU-16861 obdfilter: Exclude quotes when getting NIDs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c265e1c7b045bf1f9e5b2919c282b63086929ab6

Comment by Peter Jones [ 18/Jan/24 ]

Landed for 2.16

Generated at Sat Feb 10 03:30:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.