[LU-9259] sanity test_17o failed with 'stat file should fail' Created: 27/Mar/17  Updated: 29/Jun/17  Resolved: 19/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

review-dne


Issue Links:
Duplicate
is duplicated by LU-9246 sanity test_17o: @@@@@@ FAIL: stat fi... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 17o is failing with the error message

'stat file should fail' 

sanity test 17o touches a file, fails the MDS and then checks to see if the file exists. Here’s the code:

 703         local WDIR=$DIR/${tdir}o
 704         local mdt_index
 705         local rc=0
 706 
 707         test_mkdir -p $WDIR
 708         mdt_index=$($LFS getstripe -M $WDIR)
 709         mdt_index=$((mdt_index+1))
 710 
 711         touch $WDIR/$tfile
 712 
 713         #fail mds will wait the failover finish then set
 714         #following fail_loc to avoid interfer the recovery process.
 715         fail mds${mdt_index}
 716 
 717         #define OBD_FAIL_OSD_LMA_INCOMPAT 0x194
 718         do_facet mds${mdt_index} lctl set_param fail_loc=0x194
 719         ls -l $WDIR/$tfile && rc=1
 720         do_facet mds${mdt_index} lctl set_param fail_loc=0
 721         [[ $rc -ne 0 ]] && error "stat file should fail"

There’s nothing interesting in the console logs to explain why the file exists.

So far, I only see failures for this error for review-dne.

This test failed with this error message last year a bit, stopped failing, and started again recently. Here are the most recent failures:
2017-03-25 –https://testing.hpdd.intel.com/test_sets/438b1b98-116a-11e7-8920-5254006e85c2
2017-03-25 – https://testing.hpdd.intel.com/test_sets/0053e2ca-1146-11e7-8920-5254006e85c2
2017-03-06 – https://testing.hpdd.intel.com/test_sets/5dcc4dd2-0293-11e7-8394-5254006e85c2
2016-10-21 – https://testing.hpdd.intel.com/test_sets/b1d77bc4-980b-11e6-9e8a-5254006e85c2
2016-10-19 - https://testing.hpdd.intel.com/test_sets/06c986de-961f-11e6-9722-5254006e85c2
2016-09-27 - https://testing.hpdd.intel.com/test_sets/8c0c48aa-85b1-11e6-91aa-5254006e85c2



 Comments   
Comment by Joseph Gmitter (Inactive) [ 27/Mar/17 ]

Hi Fan Yong,

Can you please have a look into this issue?

Thanks.
Joe

Comment by Andreas Dilger [ 27/Mar/17 ]

It looks from the logs (2017-03-06 at least) that the fail_loc check is not being hit. I'm not sure if that is because the file is not being created on the MDS where the fail_loc is set, or possibly the file attributes are cached on the client and the MDS isn't being involved in the lookup.

The test itself could be improved a bit:

        test_mkdir -p $WDIR
        mdt_index=$($LFS getstripe -M $WDIR)
        mdt_index=$((mdt_index+1))

        touch $WDIR/$tfile

The mdt_index should be gotten from the file after it is created instead of from the directory, since the directory is striped 2 ways by default, and "getstripe -M" on a striped directory will only return the stripe0/master index, which isn't necessarily where the inode will be allocated (depends on filename and hash function).

Also, the client MDC DLM lock cache should be flushed so that the client is sure to do a lookup on the MDS.

Comment by Gerrit Updater [ 28/Mar/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26225
Subject: LU-9259 tests: set fail_loc on the right MDT
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e6a944d82642bc6994d4ee5e44ff9bc25604ce2f

Comment by Gerrit Updater [ 19/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26225/
Subject: LU-9259 tests: set fail_loc on the right MDT
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a0a812d2b019b97356b0d6a1a8debd7d46fed00b

Generated at Sat Feb 10 02:24:36 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.