[LU-4765] Failure on test suite sanity test_39l: atime is not updated - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.9.0, Lustre 2.10.0, Lustre 2.11.0
Labels:
None
Environment:
server: lustre-master build # 1937
client: SLES11 SP3

Severity:
3
Rank (Obsolete):
13108

Description

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/5b796592-aa08-11e3-b4b1-52540035b04c.

The sub-test test_39l failed with the following error:

atime is not updated from future: 1426094124, 1394558125<atime<1394558125

== sanity test 39l: directory atime update =========================================================== 10:15:24 (1394558124)
CMD: shadow-8vm12 lctl get_param -n mdd.*MDT0000*.atime_diff
 sanity test_39l: @@@@@@ FAIL: atime is not updated from future: 1426094124, 1394558125<atime<1394558125

Attachments

Issue Links

is related to

LU-14019 sanity test_39l: @@@@@@ FAIL: atime is not updated

Open

is related to

LU-1783 sanity test_39l failed: atime is not updated

Open

mentioned in: Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(9 mentioned in)

Activity

[LU-4765] Failure on test suite sanity test_39l: atime is not updated

Jinshan Xiong (Inactive) added a comment - 15/Sep/14 6:24 PM

I took a look at atime problem in Lustre when I was investigating POSIX compliance test failure at http://review.whamcloud.com/11844.

The conclusion is that POSIX atime is not strictly honored when the reading data is already in cache. To fix this problem, we need to send a RPC to the MDT to update atime for every read syscall, this will introduce huge performance loss.

This issue can be easily reproduce as follows:

[root@mercury lustre]# cat passwd > /dev/null
[root@mercury lustre]# stat passwd 
  File: `passwd'
  Size: 1533      	Blocks: 8          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205255725455  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-09-15 07:26:39.000000000 -0700
Modify: 2014-09-15 06:34:54.000000000 -0700
Change: 2014-09-15 06:34:54.000000000 -0700
[root@mercury lustre]# cat passwd > /dev/null
[root@mercury lustre]# stat passwd 
  File: `passwd'
  Size: 1533      	Blocks: 8          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205255725455  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-09-15 07:26:48.000000000 -0700
Modify: 2014-09-15 06:34:54.000000000 -0700
Change: 2014-09-15 06:34:54.000000000 -0700
[root@mercury lustre]# lctl set_param ldlm.namespaces.*.lru_size=0
ldlm.namespaces.MGC172.30.24.132@tcp.lru_size=0
ldlm.namespaces.lustre-MDT0000-lwp-MDT0000.lru_size=0
ldlm.namespaces.lustre-MDT0000-lwp-OST0000.lru_size=0
ldlm.namespaces.lustre-MDT0000-lwp-OST0001.lru_size=0
ldlm.namespaces.lustre-MDT0000-mdc-ffff88014ce49400.lru_size=0
ldlm.namespaces.lustre-OST0000-osc-MDT0000.lru_size=0
ldlm.namespaces.lustre-OST0000-osc-ffff88014ce49400.lru_size=0
ldlm.namespaces.lustre-OST0001-osc-MDT0000.lru_size=0
ldlm.namespaces.lustre-OST0001-osc-ffff88014ce49400.lru_size=0
[root@mercury lustre]# stat passwd 
  File: `passwd'
  Size: 1533      	Blocks: 8          IO Block: 4194304 regular file
Device: 2c54f966h/743766374d	Inode: 144115205255725455  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2014-09-15 07:26:39.000000000 -0700
Modify: 2014-09-15 06:34:54.000000000 -0700
Change: 2014-09-15 06:34:54.000000000 -0700

The atime of the 2nd read "cat passwd > /dev/null" was not honored. It only updated atime in memory but never updated it on the OST. Therefore, newly refreshed read kept the old atime of the first read.

Jinshan Xiong (Inactive) added a comment - 15/Sep/14 6:24 PM I took a look at atime problem in Lustre when I was investigating POSIX compliance test failure at http://review.whamcloud.com/11844 . The conclusion is that POSIX atime is not strictly honored when the reading data is already in cache. To fix this problem, we need to send a RPC to the MDT to update atime for every read syscall, this will introduce huge performance loss. This issue can be easily reproduce as follows: [root@mercury lustre]# cat passwd > /dev/null [root@mercury lustre]# stat passwd File: `passwd' Size: 1533 Blocks: 8 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205255725455 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-09-15 07:26:39.000000000 -0700 Modify: 2014-09-15 06:34:54.000000000 -0700 Change: 2014-09-15 06:34:54.000000000 -0700 [root@mercury lustre]# cat passwd > /dev/null [root@mercury lustre]# stat passwd File: `passwd' Size: 1533 Blocks: 8 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205255725455 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-09-15 07:26:48.000000000 -0700 Modify: 2014-09-15 06:34:54.000000000 -0700 Change: 2014-09-15 06:34:54.000000000 -0700 [root@mercury lustre]# lctl set_param ldlm.namespaces.*.lru_size=0 ldlm.namespaces.MGC172.30.24.132@tcp.lru_size=0 ldlm.namespaces.lustre-MDT0000-lwp-MDT0000.lru_size=0 ldlm.namespaces.lustre-MDT0000-lwp-OST0000.lru_size=0 ldlm.namespaces.lustre-MDT0000-lwp-OST0001.lru_size=0 ldlm.namespaces.lustre-MDT0000-mdc-ffff88014ce49400.lru_size=0 ldlm.namespaces.lustre-OST0000-osc-MDT0000.lru_size=0 ldlm.namespaces.lustre-OST0000-osc-ffff88014ce49400.lru_size=0 ldlm.namespaces.lustre-OST0001-osc-MDT0000.lru_size=0 ldlm.namespaces.lustre-OST0001-osc-ffff88014ce49400.lru_size=0 [root@mercury lustre]# stat passwd File: `passwd' Size: 1533 Blocks: 8 IO Block: 4194304 regular file Device: 2c54f966h/743766374d Inode: 144115205255725455 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2014-09-15 07:26:39.000000000 -0700 Modify: 2014-09-15 06:34:54.000000000 -0700 Change: 2014-09-15 06:34:54.000000000 -0700 The atime of the 2nd read "cat passwd > /dev/null" was not honored. It only updated atime in memory but never updated it on the OST. Therefore, newly refreshed read kept the old atime of the first read.

Bob Glossman (Inactive) added a comment - 02/Jun/14 3:03 PM

currently not reproducible in our 2.6 builds anymore.
klibc? do you mean glibc? as far as I know the relevant #include files are part of glibc-devel. I believe current version of glibc-devel in sles11sp3 is 2.11.3, so if you have 1.<anything> you are seriously out of date.

Are you rolling your own builds? Again, as far as I know you should not be having problems if you just install and use our prebuilt rpms.

Bob Glossman (Inactive) added a comment - 02/Jun/14 3:03 PM currently not reproducible in our 2.6 builds anymore. klibc? do you mean glibc? as far as I know the relevant #include files are part of glibc-devel. I believe current version of glibc-devel in sles11sp3 is 2.11.3, so if you have 1.<anything> you are seriously out of date. Are you rolling your own builds? Again, as far as I know you should not be having problems if you just install and use our prebuilt rpms.

Mark Mansk added a comment - 02/Jun/14 2:41 PM

We're seeing this failure as well for 2.6. Not sure if/when we'll be updating our sles11sp3. Our current klibc is at 1.5.18.

Any way to workaround this besides comment out the test?

Mark Mansk added a comment - 02/Jun/14 2:41 PM We're seeing this failure as well for 2.6. Not sure if/when we'll be updating our sles11sp3. Our current klibc is at 1.5.18. Any way to workaround this besides comment out the test?

Bob Glossman (Inactive) added a comment - 30/Apr/14 9:39 PM

https://maloo.whamcloud.com/test_sets/a9add412-d0ac-11e3-b9d4-52540035b04c shows test 39l now passing. The only remaining failures in sanity,sh on sles11sp3 clients are known bugs ~~LU-4713~~ and LU-4341.

Bob Glossman (Inactive) added a comment - 30/Apr/14 9:39 PM https://maloo.whamcloud.com/test_sets/a9add412-d0ac-11e3-b9d4-52540035b04c shows test 39l now passing. The only remaining failures in sanity,sh on sles11sp3 clients are known bugs LU-4713 and LU-4341 .

Bob Glossman (Inactive) added a comment - 29/Apr/14 11:39 PM

test only patch to verify TEI-2052 changes
http://review.whamcloud.com/10156

Bob Glossman (Inactive) added a comment - 29/Apr/14 11:39 PM test only patch to verify TEI-2052 changes http://review.whamcloud.com/10156

Bob Glossman (Inactive) added a comment - 29/Apr/14 4:30 PM

Following a suspicion that mount options on SP3 were out of wack I took a close look at some #include files currently on our sles11sp3 builder nodes. I discovered that in fact the #include files there were seriously out of date. In particular in mount.h on the builder node MS_RDONLY is #defined but MS_STRICTATIME isn't. I'm pretty sure this causes the mount.lustre built in our sles11sp3 artifacts to not have the special override code making MS_STRICTATIME the default in all mounts. This would explain why the 39l is failing in autotest runs, but not in my manual local runs where mount.lustre is built with more up to date tools and #includes.

I will file a TEI ticket requesting the environment on SP3 builders be updated. Fairly certain that will make this problem go away.

Bob Glossman (Inactive) added a comment - 29/Apr/14 4:30 PM Following a suspicion that mount options on SP3 were out of wack I took a close look at some #include files currently on our sles11sp3 builder nodes. I discovered that in fact the #include files there were seriously out of date. In particular in mount.h on the builder node MS_RDONLY is #defined but MS_STRICTATIME isn't. I'm pretty sure this causes the mount.lustre built in our sles11sp3 artifacts to not have the special override code making MS_STRICTATIME the default in all mounts. This would explain why the 39l is failing in autotest runs, but not in my manual local runs where mount.lustre is built with more up to date tools and #includes. I will file a TEI ticket requesting the environment on SP3 builders be updated. Fairly certain that will make this problem go away.

Bob Glossman (Inactive) added a comment - 28/Apr/14 10:10 PM - edited

seeing a lot of these in autotest test runs with sles11sp3 clients. have been able to reproduce this bug manually, but only if I explicitly mount a client with the -o relatime option. as far as I can tell this option isn't typically used, mount.lustre sets MS_STRICTATIME by default. Is there something in autotest that is forcing the relatime option in client mounts? I haven't found anything in logs that would indicate that is happening.

Bob Glossman (Inactive) added a comment - 28/Apr/14 10:10 PM - edited seeing a lot of these in autotest test runs with sles11sp3 clients. have been able to reproduce this bug manually, but only if I explicitly mount a client with the -o relatime option. as far as I can tell this option isn't typically used, mount.lustre sets MS_STRICTATIME by default. Is there something in autotest that is forcing the relatime option in client mounts? I haven't found anything in logs that would indicate that is happening.

Bob Glossman (Inactive) added a comment - 28/Apr/14 7:39 PM

another:
https://maloo.whamcloud.com/test_sets/63de2e74-cf07-11e3-a250-52540035b04c

Bob Glossman (Inactive) added a comment - 28/Apr/14 7:39 PM another: https://maloo.whamcloud.com/test_sets/63de2e74-cf07-11e3-a250-52540035b04c

Andreas Dilger added a comment - 17/Mar/14 6:08 PM

This may relate to LU-1783 which was also a problem on newer kernels with atime updates from the future. Yang Sheng had made a patch for upstream, but it wasn't accepted for a variety of minor reasons. It should probably be refreshed and tried again.

Andreas Dilger added a comment - 17/Mar/14 6:08 PM This may relate to LU-1783 which was also a problem on newer kernels with atime updates from the future. Yang Sheng had made a patch for upstream, but it wasn't accepted for a variety of minor reasons. It should probably be refreshed and tried again.

Bob Glossman (Inactive) added a comment - 13/Mar/14 9:45 PM

seems to be something wrong with the client/server combo being run here. the server Centos rpms are from a recent build, lustre-master #1937, but the sles11sp3 client rpms appear to be much older. Is this a version interop run of some sort? If not why aren't the client and server rpms from the same build?

Bob Glossman (Inactive) added a comment - 13/Mar/14 9:45 PM seems to be something wrong with the client/server combo being run here. the server Centos rpms are from a recent build, lustre-master #1937, but the sles11sp3 client rpms appear to be much older. Is this a version interop run of some sort? If not why aren't the client and server rpms from the same build?

People

Assignee:: Bob Glossman (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Mar/14 9:26 PM

Updated:: 29/Jul/24 11:57 PM

Resolved:: 25/Feb/21 12:38 AM