[LU-171] Test failure on test suite runtests Created: 28/Mar/11  Updated: 06/Apr/11  Resolved: 06/Apr/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Jian Yu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 10271

 Description   

This issue was created by maloo for Prakash Surya <surya1@llnl.gov>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/30252906-594f-11e0-a272-52540025f9af.

The cache is not being flushed to disk before the Lustre filesystem is unmounted. After the Lustre unmount and remount, the files on the Lustre filesystem are all empty. This is what causes the file diffs to fail.

If I add the command 'echo 3 > /proc/sys/vm/drop_caches' just before the command to unmount Lustre, the tests will pass.



 Comments   
Comment by Peter Jones [ 30/Mar/11 ]

Yu Jian

Could you please look into this one?

Thanks

Peter

Comment by Jian Yu [ 01/Apr/11 ]

Hi Prakash,

Did you run 'sync' before 'echo 3 > /proc/sys/vm/drop_caches'? The latter command is just to drop clean pagecache, dentries and inodes from memory. It does not free dirty objects, nor flush the data out to disk.

In addition, I saw this message in the test output:
cp: cannot create regular file `/sbin/./mount.lustre': Permission denied
Did you run something special before re-mounting the Lustre filesystem?

Comment by Prakash Surya (Inactive) [ 01/Apr/11 ]

Yu Jian,

As long as 'echo 3 > /proc/sys/vm/drop_caches' is being called, the test will pass whether sync is called or not. I have tried both, calling sync before and not calling it at all, and it makes no difference. Also, the test with not pass without 'echo 3 > /proc/sys/vm/drop_caches', even if 'sync' is still called.

Just in case this proves useful, here are the results of the four tests:
Without drop_caches or sync: https://maloo.whamcloud.com/test_sets/7ec49f8e-5c8c-11e0-a272-52540025f9af
With only drop_caches (no call to sync): https://maloo.whamcloud.com/test_sets/8e1e510a-5c8c-11e0-a272-52540025f9af
With sync before drop_caches: https://maloo.whamcloud.com/test_sets/946183ac-5c8c-11e0-a272-52540025f9af
With only sync (no call to drop_caches): https://maloo.whamcloud.com/test_sets/9b0749a8-5c8c-11e0-a272-52540025f9af

As far as the 'cp: cannot create regular file `/sbin/./mount.lustre': Permission denied' error, I believe that is because of the way our node is configured. I am running these tests on a diskless node, and the /sbin directory is mounted read only, which causes the copy to fail. Although, the mount.lustre binary is already installed in /sbin on the diskless image (which explains the 'Permission denied' error, rather than a 'Read-only file system' error).

Comment by Jian Yu [ 02/Apr/11 ]

Thanks Prakash for the tests.

From the Maloo reports, I found the kernel version was 2.6.32-14chaos. And the failure reported in this ticket is very similar to the one in bug 23064. There are two patches for this bug:

  1. attachment 32045
  2. attachment 32564

The first patch was pushed to master branch on Nov. 4, 2010. The second one was ported to master branch and landed on Mar. 24, 2011 with the other patches in http://review.whamcloud.com/307.

Could you please check whether the Lustre codes you used have the patches or not? If yes, could you please set "PTLDEBUG=-1", reproduce the issue again and upload the lctl debug log file (gathered by the test script right after the test failed) to this ticket?

FYI, the issue could not be reproduced on RHEL6/x86_64 with kernel 2.6.32-71.18.2.el6 against the latest master codes on our test node. Here is the successful report:
https://maloo.whamcloud.com/test_sets/f4ffd0c6-6031-11e0-a2b4-52540025f9af

Comment by Prakash Surya (Inactive) [ 06/Apr/11 ]

Thanks for the info Yu Jian.

I pulled down the latest master branch this morning and have not been able to reproduce the issue. Both patches you refer to in your previous comment are definitely in this tree.

I beleive the previous tree I was working on did not have the patches from http://review.whamcloud.com/307, which is likely the reason for the failed test.

Here are the new successful test results:
Run #1: https://maloo.whamcloud.com/test_sets/d39dafe6-606a-11e0-a2b4-52540025f9af
Run #2: https://maloo.whamcloud.com/test_sets/d7174b32-606a-11e0-a2b4-52540025f9af

Thanks for the help! I imagine this ticket can be marked as resolved.

Comment by Peter Jones [ 06/Apr/11 ]

Thanks for letting us know Prakash!

Generated at Sat Feb 10 01:04:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.