[LU-3849] Client cache directory information is not updated Created: 28/Aug/13  Updated: 16/Oct/13  Resolved: 11/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sebastien Buisson (Inactive) Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-3240 The link count is not updated after t... Resolved
Severity: 3
Rank (Obsolete): 9968

 Description   

Hi,

On a customer cluster, we noticed that Lustre client cache directory information is not updated when directory owner or access mode changes.

If a user tries to change owner/group or directory mode and then looks for the change from another node, the change is not visible. Moreover, the more we look, the more the cache update is delayed.

We were suspecting troubles related to LU-3671, but we checked that sync_permission is set to 1.

This problem is easy to reproduce:
Node 1 : make directory
with umask 0027
mkdir <lustre filesystem>/<new directory>

Node 2 : add directory in node2 cache
cd <lustre filesystem>/<new directory>
ls -lia

Node 1 : update directory
chmod 770 <lustre filesystem>/<new directory>
or
chown
or both

Node 2 : look for the change
ls -lia (don't wait more than 10s)

Thanks,
Sebastien.



 Comments   
Comment by Peter Jones [ 28/Aug/13 ]

Lai

Could you please advise on this one?

Thanks

Peter

Comment by Lai Siyao [ 29/Aug/13 ]

This issue should have existed for quite some time:

  • open dir by `ls -lia` fetches CR LOOKUP|OPEN lock.
  • setattr by `chown ...` enqueues a PW LOOKUP lock.

But CR and PW mode are compatible, so the first lock is not canceled, and client get old stats.

Changing setattr lock mode from PW to EX can fix this, I'll commit a fix later.

Comment by Alexey Lyashkov [ 29/Aug/13 ]

dup of LU-3240

Comment by Sebastien Buisson (Inactive) [ 29/Aug/13 ]

Alex,

You mean patch at http://review.whamcloud.com/6460 will fix this issue?

Comment by Alexey Lyashkov [ 29/Aug/13 ]

i'm not a fully sure, but that patch disable matching a some locks's locally, a specially client will skip open lock if they need attributes for an inode (http://review.whamcloud.com/#/c/6460/3/lustre/mdc/mdc_locks.c).

also readdir isn't create a lock to match to get an valid attributes, so i suggest to ask you to test that patch before Liang will create a new patch.

Comment by Patrick Valentin (Inactive) [ 30/Aug/13 ]

I have reproduced this error on lustre 2.1.6 and lustre 2.4.0.

I have backported this patch on lustre 2.4 and I will test it next week.

I have made some additional tests, and I have noticed that the information printed by "ls" is correct if we specify the directory.
"ls -la ." or "ls -la $PWD" on "node 2" prints the current (correct) inode information.

But if we issue a chmod/chown/chgrp/mkdir/rmdir on "node 1", and then a "ls" on "node 2", without specifying a "directory name", and whatever option is specified, then the inode information printed does not reflect the last change made on "node 1". Moreover, if we issue a second "ls" command, but this time adding "." or "$PWD", the printed inode information is still bad.

Looking at "ls" source code, it appears that the code starts with an "opendir()" and a loop of "readdir()" on the current directory, when no file or directory is specified. If a name is specified, then the command starts by "stat()" on this name.

I wrote the following simple command, allowing to show that the culprit is readdir().

The command first issues an opendir() of "." and a loop of readdir()/printf() , but only if an argument is specified.
It then issues a stat() on ".", and prints some inode fields.

#include <sys/types.h>
#include <errno.h>
#include <sys/stat.h>
#include <dirent.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

main(int argc, char **argv)
{
    struct stat dir_stat;
    DIR *dirp;
    struct dirent *next;

    if (argc > 1) {
        dirp = opendir(".");
        if (! dirp) {
            perror("opendir: ");
            exit(1);
        }
        while (next = readdir(dirp)) {
            printf("%s ", next->d_name);
        }
        printf("\n");
        closedir(dirp);
    }

    if (stat(".", &dir_stat) < 0) {
        perror("stat: ");
        exit(1);
    }
    printf("st_nlink %d, st_mode 0%ho, st_uid  0x%hx, st_gid  0x%hx\n", dir_stat.st_nlink, dir_stat.st_mode, dir_stat.st_uid, dir_stat.st_gid);

    exit(0);
}

To reproduce the error, you have to run the chmod/chown/chgrp/mkdir/rmdir commands on "node 1", and after each one, run "my_stat --" on "node 2".
The changes made on "node 1" are not reflected on "node 2".

Now re-issue one of the commands on "node 1", and run "my_stat" without argument on "node 2". All the changes made previously are now visible.

I have also checked that the problem described in LU-3240 is indentical, but only with mkdir/rmdir when only one node is in use.

for i in `seq 1 10`; do
    mkdir d$i
    my_stat --
done
mkdir last
my_stat

During the loop, the st_nlink is not correct (always 2, instead of 3, 4, 5, ...).
At the end of the loop, after issueing a new mkdir, and then a stat() without readdir(), the st_nlink=13 is the correct value.

Comment by Lai Siyao [ 03/Sep/13 ]

In my test I found another issue:

  • ll_get_dir_page() may enqueue UPDATE lock, but it doesn't inode attributes, so these informations may look stale on client side with UPDATE lock held.
Comment by Alexey Lyashkov [ 03/Sep/13 ]

Lyang,

problem is - ll_get_dir_page take a lock but don't take a intent body so don't update inode with info from server.

> During the loop, the st_nlink is not correct (always 2, instead of 3, 4, 5, ...).
take a patch, it's looks same problem and test looks same please look to sanity 39o test.

Comment by Patrick Valentin (Inactive) [ 05/Sep/13 ]

Alexey, Lai,
I tested set 4 of patch http://review.whamcloud.com/6460 on a lustre 2.4.0 client, and it fixes the issue described above, as well as the issue described in LU-3240.

Comment by Alexey Lyashkov [ 05/Sep/13 ]

Thanks to testing! I hope Intel will found time to land it patch.

Comment by Christopher Morrone [ 06/Sep/13 ]

Alexey, the patch in question has a -1 code review which will need to be resolved (additional explanation, code fix, whatever is needed) before before landing can be expected.

Comment by Alexey Lyashkov [ 07/Sep/13 ]

Christopher,

I think we need to choose a different reviewer because questions is very strange and can't be implemented without MDT rewrite (a specially mdt_getattr_name_lock don't able to return lock without look to parent).
If IT_READIR was don't drop on 2.x protocol it's more easy to fix, but now it's need too much compatibility code.

Comment by Nathaniel Clark [ 18/Sep/13 ]

Alexey,

Can you respond to Fan Young's comments in gerrit and maybe we can find a way to move forward with the patch?

Comment by Andreas Dilger [ 11/Oct/13 ]

This problem is very similar to LU-3240 and will be fixed by the patch on that bug.

Generated at Sat Feb 10 01:37:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.