[LU-1783] sanity test_39l failed: atime is not updated Created: 23/Aug/12  Updated: 08/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0, Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Bob Glossman (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

server: lustre-master-tag2.2.93 RHEL6
client: lustre-master-tag2.2.93 SLES11


Issue Links:
Related
is related to LU-4765 Failure on test suite sanity test_39l... Resolved
is related to LU-15728 'relatime' is not working properly Resolved
Severity: 3
Bugzilla ID: 24,198
Rank (Obsolete): 4424

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/ed029fec-ec3b-11e1-ba25-52540035b04c.

The sub-test test_39l failed with the following error:

atime is not updated from future: 1377057515, should be 1345521515<atime<1345521515

20:58:37:== sanity test 39l: directory atime update =========================================================== 20:58:35 (1345521515)
20:58:37:CMD: client-23vm3 lctl get_param -n mdd.*.atime_diff
20:58:38:flink
20:58:38:flink2
20:58:38:fopen
20:58:38:frename2
20:58:38:funlink
20:58:38: sanity test_39l: @@@@@@ FAIL: atime is not updated from future: 1377057515, should be 1345521515<atime<1345521515 
20:58:38:  Trace dump:
20:58:38:  = /usr/lib64/lustre/tests/test-framework.sh:3614:error_noexit()
20:58:38:  = /usr/lib64/lustre/tests/test-framework.sh:3636:error()
20:58:38:  = /usr/lib64/lustre/tests/sanity.sh:2503:test_39l()
20:58:38:  = /usr/lib64/lustre/tests/test-framework.sh:3869:run_one()
20:58:38:  = /usr/lib64/lustre/tests/test-framework.sh:3898:run_one_logged()
20:58:38:  = /usr/lib64/lustre/tests/test-framework.sh:3772:run_test()
20:58:38:  = /usr/lib64/lustre/tests/sanity.sh:2529:main()
20:58:38:Dumping lctl log to /logdir/test_logs/2012-08-20/lustre-master-el6-x86_64-sles11-x86_64__787__-7fad9a6fd1d8/sanity.test_39l.*.1345521516.log
20:58:38:CMD: client-23vm1,client-23vm2,client-23vm3,client-23vm4 /usr/sbin/lctl dk > /logdir/test_logs/2012-08-20/lustre-master-el6-x86_64-sles11-x86_64__787__-7fad9a6fd1d8/sanity.test_39l.debug_log.\$(hostname -s).1345521516.log;
20:58:38:         dmesg > /logdir/test_logs/2012-08-20/lustre-master-el6-x86_64-sles11-x86_64__787__-7fad9a6fd1d8/sanity.test_39l.dmesg.\$(hostname -s).1345521516.log
20:58:44:FAIL 39l (8s)


 Comments   
Comment by Peter Jones [ 25/Aug/12 ]

As far as I can see this has only happened once and another issue is responsible for most of the failures on snaity. I am dropping the priority for now but we can reinstate as a blocker if we start seeing this more frequently. Please speak up if I have missed something.

Comment by Sarah Liu [ 04/Sep/12 ]

another failure: https://maloo.whamcloud.com/test_sets/666b0c34-f24f-11e1-9def-52540035b04c

Comment by Jian Yu [ 17/Sep/12 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_3/19
Distro/Arch: SLES11SP1/x86_64(client), RHEL6.3/x86_64(server)

The same issue occurred again: https://maloo.whamcloud.com/test_sets/005ddd18-005a-11e2-9f3c-52540035b04c

Comment by Peter Jones [ 17/Sep/12 ]

Bob will look into this one

Comment by Bob Glossman (Inactive) [ 18/Sep/12 ]

I have managed to reproduce this failure in the simplest possible context: 1 VM (el6) serving MGS/MDT/OSTs, 1 VM a sles11 client. Have confirmed that the failure is specific to a sles client, doesn't happen with an el6 client. As far as I can see there is a behavior difference between sles and rhel clients. It appears the 'lctl set_param -n ldlm.namespaces.*.lru_size=clear' called from the cancel_lru_locks() function in the test flushes & updates the atime on the target dir in the rhel client, but not in the sles client. I haven't been able to figure out why that is. I have dmesg and debug logs from the failing test that I can post here.

Comment by Bob Glossman (Inactive) [ 19/Sep/12 ]

I've tried adding some extra CDEBUG() to the atime case in ll_update_inode(), and only convinced myself that doing 'ls <dirname>' on a lustre dir doesn't go down that code path. I can see the debug if I do for instance 'touch <dirname>', but that isn't what the failing test case does.

Looking at the call sequence down from linux, ls <dir> appears to go thru vfs_readdir() > file>f_op->readdir() -> ll_readdir(). Immediately on return from ll_readdir() linux calls file_accessed(), which at the very least updates atime in-memory. This is true in both el6 and sles11 source, which look quite similar in this area. I haven't been able to spot any obvious differences.

Spent a lot of time examining client debug logs from both el6 and sles11. Haven't been able to dig out what the significant difference is, if any. Have reduced the failing test case down to a small subset:

stat -c %X /mnt/lustre/tdir    #see atime of test dir before starting

#wait for a few seconds

lctl clean
ls /mnt/lustre/tdir     #cause atime update
cancel_lru_locks mdc    #should flush out the time change
stat -c %X /mnt/lustre/tdir   #see atime after the test
lctl dk dklog.$(date +%s)  #collect debug log of just the test activity

On el6 client this behaves correctly with the atime on the test dir updated. On sles11 the atime stays at same value as before starting, it doesn't advance.

Restricting the logs collected to just this minimum keeps them manageably small, even with debug set to -1. In spite of that still can't see where the significant difference is.

Comment by Bob Glossman (Inactive) [ 20/Sep/12 ]

More or less determined that it's something inside sles guts that denies atime updates. Since this problem is only seen in sles11 SP1 clients and sles11 SP1 is already out of support, for now we will just skip this subtest in sles11 SP1 clients. I plan to submit a patch to do that in sanity.sh soon.

May revisit this issue if it still shows up in sles11 SP2.

Comment by Bob Glossman (Inactive) [ 20/Sep/12 ]

http://review.whamcloud.com/4062

Comment by Keith Mannthey (Inactive) [ 20/Sep/12 ]

Bob have you confirmed you don't see this issue with SP2?

Comment by Bob Glossman (Inactive) [ 20/Sep/12 ]

Keith, No I haven't. Don't yet have an instance of SP2 to test with. If I ever get one I will check.

Comment by Peter Jones [ 22/Sep/12 ]

Dropping priority. Will aim to re-test this for SLES11 SP2

Comment by Sarah Liu [ 02/Nov/12 ]

another failure on lustre-master build# 1011:
https://maloo.whamcloud.com/test_sets/d0072b9c-2534-11e2-9e7c-52540035b04c

Comment by Bob Glossman (Inactive) [ 16/Nov/12 ]

another failure, this one in SLES11 SP2
https://maloo.whamcloud.com/test_sets/405b4216-2fcb-11e2-866f-52540035b04c

Will have to extend the test fix to skip this test for SP2 as well.
Or else a kernel expert will need to dig in to see how SLES differs from RHEL with respect to atime updates.

Comment by Bob Glossman (Inactive) [ 16/Nov/12 ]

test revision to skip the subtest for all SLES11 SPs
http://review.whamcloud.com/4605

Comment by Andreas Dilger [ 20/Nov/12 ]

Bob, is it possible that SLES clients mount with "relatime" enabled by default? That is a mount option that is somewhere between "atime enabled" and "noatime". With "relatime" the atime will only be updated if the file is accessed and the atime is older than "mtime" on the directory. This could be checked by explicitly mounting the SLES client with the "atime" option.

Comment by Bob Glossman (Inactive) [ 20/Nov/12 ]

I will revisit to make sure, but I think I tried all permutations of explicit mount options

{no}atime and {no}

relatime. mounting with -o relatime showed that option in mount -v. It didn't show in a default mount. None of the options made the atime flush as it does in el6 client.

Comment by Bob Glossman (Inactive) [ 20/Nov/12 ]

Andreas, turns out you are correct. In SuSE it looks like a lustre mount always mounts with relatime. This option shows up only in 'cat /proc/mounts', not in mount -v.

In addition in spite of being listed in the mount(8) man page, none of the command line options of atime, strictatime, or

{no}

relatime have any effect on the mount. Some aren't even accepted, cause the the mount command to fail with Invalid argument. Others are silently accepted, but have no effect. cat /proc/mounts still shows the lustre mount has the relatime option set.

All of those options appear to really work as advertised in el6.

Comment by Bob Glossman (Inactive) [ 20/Nov/12 ]

I think this problem in SLES may be a build problem in mount.lustre. If I add an explicit #include of <linux/fs.h> to the source additional conditional options get compiled into the SLES version of mount.lustre. Now a mount with no options has relatime not set, and all the extra options defined in the man page get recognized & actually work.

I will submit a patch with this 1 line fix and tear out the skip of test 39l.

Comment by Andreas Dilger [ 28/Nov/12 ]

Seems that this was originally fixed for RHEL6 with https://bugzilla.lustre.org/show_bug.cgi?id=24198 and is now hit on SLES11SP2 as well. The problem is that the kernel defaults to MS_RELATIME, and this causes test_39l to fail to reset an atime in the future.

Forcing MS_STRICTATIME on all Lustre clients is undesirable, since it adds to the overhead on the client, but since atime handling is done in the VFS we need an upstream patch.

It looks like Yang Sheng sent a patch upstream for this in http://lkml.org/lkml/2011/1/4/88 22 months ago, which got generally favorable reviews, but had a small typo in the comment. That patch needs to be updated to the latest kernel, fix the typo, and preferably wrap the "future atime check" in unlikely() since this will indeed be very unlikely to happen.

Andrew Morton also requested that the patch have a better commit comment:

Relatime should update the inode atime if it is more than a day in the
future.  The original problem seen was a tarball that had a bad atime,
but could also happen if someone fat-fingers a "touch".  The future
atime will never be fixed.  Before the relatime patch, the future atime
would be updated back to the current time on the next access.

Only update the atime if it is more than one day in the future.  That
avoids thrashing the atime if the clocks on clients of a network fs are
only slightly out of sync, but still allows fixing bogus atimes.

Please also CC stable@vger.kernel.org so that the patch is more likely to be backported to vendor kernels.

Comment by Robert Read (Inactive) [ 05/May/16 ]

It's been a while since this has been updated. Has the fix landed upstream now?

Comment by Andreas Dilger [ 05/May/16 ]

I tried to submit an updated patch upstream (http://www.spinics.net/lists/linux-fsdevel/msg60515.html), but it wasn't accepted due to minor complaints about the patch description and implementation. I didn't get around to resubmitting it at the time, and I suspect there will be a general apathy about changing the behaviour of a patch that landed 7 years ago.

It probably makes more sense to remove forcing MS_STRICTATIME in mount.lustre and change the MDD to default to MAX_ATIME_DIFF=24*3600 to match the kernel relatime and honor the "(which will improve normal filesystem performance due to reduced atime updates) and then change sanity.sh test_39l to remove the check for setting atime backward from the future on normal access since it seems people don't care about this very much.

Generated at Sat Feb 10 01:19:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.