[LU-1688] recovery-small: test_58 failed with 1 Created: 27/Jul/12  Updated: 22/Dec/12  Resolved: 22/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.2
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4

Type: Bug Priority: Minor
Reporter: Jay Lan (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None
Environment:

https://github.com/jlan/lustre-nas/tree/nas-2.1.2
recovery-small test_58
mds: service337
oss1: service361
oss2: service362
clients: service331, service332


Attachments: File recovery-small.test_58.test_log.service331.log.dbg     File recovery-small.test_58.tgz    
Severity: 3
Rank (Obsolete): 4423

 Description   

== recovery-small test 58: Eviction in the middle of open RPC reply processing ======================= 14:48:37 (1343339317)
rw-rr- 1 root root 0 Jul 26 14:48 /mnt/nbp0-1/f58
fail_loc=0x80000801
fail_loc=0
fail_loc=0x305
fail_loc=0
df: `/mnt/nbp0-1': Interrupted system call
df: no file systems processed
recovery-small test_58: @@@@@@ FAIL: test_58 failed with 1

Attached two files:
recovery-small.test_58.tgz - tarball of the test_log files
recovery-small.test_58.test_log.service331.log.dbg: output of the test with shell debugging of "set -x". The log showed the test passed, but it was a flase positive. The 'df' failed with 1, yet a subsequent "set +x" set the return value to 0, thus gave a false positive.



 Comments   
Comment by Peter Jones [ 27/Jul/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 30/Jul/12 ]

the eviction of this client is just caused by the revalidate request on the root inode of "df", then this issue is triggered.
and this bug should be fixed by waiting some time before calling "df" or doing something else to trigger the evcition/recovery.

Comment by Hongchao Zhang [ 01/Aug/12 ]

the possible patch is tracked at http://review.whamcloud.com/#change,3506.

Hi Jay, Is this issue reproducible, and if so, could you please help to test with the patch?

Comment by Jay Lan (Inactive) [ 01/Aug/12 ]

== recovery-small test 58: Eviction in the middle of open RPC reply processing ======================= 11:54:50 (1343847290)
rw-rr- 1 root root 0 Aug 1 11:54 /mnt/nbp0-1/f58
fail_loc=0x80000801
fail_loc=0
fail_loc=0x305
fail_loc=0
df: `/mnt/nbp0-1': Interrupted system call
df: no file systems processed
Filesystem 1K-blocks Used Available Use% Mounted on
service337@o2ib:/lustre
3937056 209208 3527720 6% /mnt/nbp0-1
Resetting fail_loc on all nodes...done.
PASS 58 (40s)

From the above log, you can see the first 'df' failed and the second 'df' passed with 'df' output!

Comment by Jay Lan (Inactive) [ 01/Aug/12 ]

It would be nice if you can check the status of the first 'df' command, and perform the second 'df' only if the first returns failure.

Comment by Hongchao Zhang [ 03/Aug/12 ]

Hi Jay, the patch is updated as per your advice, thanks a lot!

Comment by Jay Lan (Inactive) [ 06/Aug/12 ]

The new patch looks good to me, and the test passed. Thanks!

Comment by Jay Lan (Inactive) [ 04/Sep/12 ]

Can we complete the review and land the patch? Thanks!

Comment by Peter Jones [ 22/Dec/12 ]

Landed for 2.1.4 and 2.4

Generated at Sat Feb 10 01:18:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.