[LU-13360] getdents() against empty striped directory always returns 1365 dirents Created: 13/Mar/20  Updated: 16/Jun/20  Resolved: 16/Jun/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8, Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Won't Fix Votes: 0
Labels: llnl
Environment:

Client: opal174 with lustre-2.10.8_5.chaos-1.ch6.x86_64
Servers: zwicky with lustre-2.12.4_2.chaos-1.ch6.x86_64
Both running TOSS 3.5-9rc1 based on RHEL 7.7

Our Lustre tags are on github:
https://github.com/LLNL/lustre/releases/tag/2.10.8_5.chaos
https://github.com/LLNL/lustre/releases/tag/2.12.4_2.chaos

zfs-0.7 based servers


Attachments: File debug.toss-4558.a.tar.gz    
Rank (Obsolete): 9223372036854775807

 Description   

Created directory with "lfs mkdir -c2 <thepath>" on Lustre 2.10.8 based client.
May have attempted create subdirs or files (not sure, I'd run a series of mdtest runs across multiple directories, and canceled via Ctrl-C partway through).
ls <thepath> appears to hang.
Running under strace shows that getdents() calls are always returning the same contents, never return EOF.

No error messages are reported on the console of the servers or the client.

[faaland1@opal174 branch:master src] $ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
faaland1 52122 52121  0 09:58 pts/0    00:00:00 -bash
faaland1 59859 52122 99 10:16 pts/0    01:20:38 ls -l /p/lforge/faaland1/mdtest/mdt0 /p/lforge/faaland1/mdtest/mdt1 /p
faaland1 64327 52122  0 11:36 pts/0    00:00:00 ps -f
[faaland1@opal174 branch:master src] $strace -p 59859 2>&1 | head -n4
strace: Process 59859 attached
getdents(3, /* 1365 entries */, 32768)  = 32760
getdents(3, /* 1365 entries */, 32768)  = 32760
getdents(3, /* 1365 entries */, 32768)  = 32760
[faaland1@opal174 branch:master src] $ls -l /proc/59859/fd
total 0
lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 0 -> /dev/pts/0
lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 1 -> /dev/pts/0
lrwx------ 1 faaland1 faaland1 64 Mar 13 10:16 2 -> /dev/pts/0
lr-x------ 1 faaland1 faaland1 64 Mar 13 10:41 3 -> /p/lforge/faaland1/mdtest/mdtcount2

The 1365 entries returned are always the same - 676 entries for ".." and 689 entries for "."

    676 {d_ino=144115272398143489, d_off=0, d_reclen=24, d_name="..", d_type=DT_DIR}
    689 {d_ino=144115339574197893, d_off=0, d_reclen=24, d_name=".", d_type=DT_DIR}

I mounted the filesystem on another Lustre 2.10.8 node and ls of that directory produces the same symptoms.

I mounted the file system on a Lustre 2.12.4 node and ls of that directory behaves as normal - getdents() is called twice, once it returns 2 entries, second time it returns 0 entries and 0 bytes (end of directory).



 Comments   
Comment by Olaf Faaland [ 13/Mar/20 ]

For my tracking purposes, my internal ticket is TOSS4558

Comment by Olaf Faaland [ 13/Mar/20 ]

Contents of the two FIDs reported by lfs getdirstripe, in case it helps.

lfs getdirstripe:

$lfs getdirstripe /p/lforge/faaland1/mdtest/mdtcount2
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx		 FID[seq:oid:ver]
     0		 [0x200002340:0x7:0x0]		
     1		 [0x280002340:0x7:0x0]

MDT0:

[root@zwicky1:toss4558]# find . -name '*0x200002340:0x7:0x0*' 2> /dev/null
./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0
./oi.64/0x200002340:0x7:0x0
[root@zwicky1:toss4558]# ls -l ./oi.64/0x200002340:0x7:0x0
ls: cannot access ./oi.64/0x200002340:0x7:0x0: Input/output error
[root@zwicky1:toss4558]# ls -al ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0
total 22
drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 .
drwx------ 4 faaland1 faaland1 2 Mar 13 10:10 ..

MDT1:

[root@zwicky2:toss4558]# find . -name "*0x280002340:0x7:0x0*" 2> /dev/null
./oi.64/0x280002340:0x7:0x0
./REMOTE_PARENT_DIR/0x280002340:0x7:0x0
[root@zwicky2:toss4558]# ls -al ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0
ls: cannot access ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0: Input/output error
./oi.64/0x280002340:0x7:0x0:
total 22
drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 .
drwxr-xr-x 0 root     root     0 Dec 31  1969 ..
Comment by Peter Jones [ 13/Mar/20 ]

Lai

Could you please investigate

Peter

Comment by Olaf Faaland [ 13/Mar/20 ]

Hello Lai,

I've attached debug.toss-4558.a.tar.gz which contains debug logs from two clients.  debug was +rpctrace and +vfstrace.  An ls -al of the same striped directory was performed on each of the two nodes.

opal64: Lustre 2.12, the ls is successful and shows "." and ".." and exits.
opal174: Lustre 2.10, the ls is stuck in the loop calling getdents() as described above.

Comment by Lai Siyao [ 16/Mar/20 ]

2.12 contains several fixes for readdir of striped directory:
https://review.whamcloud.com/#/c/27663
https://review.whamcloud.com/#/c/28548
https://review.whamcloud.com/#/c/32180

You can apply them and try again. If you can enable 'trace' in debuglog, it can help identify the exact cause, however IMO it should have been fixed by the above patches.

Comment by Olaf Faaland [ 03/Jun/20 ]

Hi Lai,
I cherry-picked those patches and tried to push them to gerrit against b2_10 fortestonly, but I'm being rejected in a way I haven't seen before. I'm not sure if there's something with my patch stack, or something else - maybe a permissions thing bubbling up with a non-obvious message.

The patch stack is here:
https://github.com/ofaaland/lustre/tree/b-toss-4558-stripedir

and the error I get is:

[faaland1@oslic5 branch:b-toss-4558-stripedir lustre-210] $git push wcrev HEAD:refs/for/b2_10
Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': 
Counting objects: 126, done.
Delta compression using up to 36 threads.
Compressing objects: 100% (67/67), done.
Writing objects: 100% (95/95), 87.24 KiB | 0 bytes/s, done.
Total 95 (delta 69), reused 36 (delta 28)
remote: Resolving deltas: 100% (69/69)
remote: Processing changes: refs: 1, done    
To ssh://review.whamcloud.com/fs/lustre-release
 ! [remote rejected] HEAD -> refs/for/b2_10 (not Signed-off-by author/committer/uploader in commit message footer)
error: failed to push some refs to 'ssh://review.whamcloud.com/fs/lustre-release'

It looks straightforward, but I do not see patches without the Signed-off-by matching the author, and I also found that even just pushing a branch with just one commit produces the same result:
https://github.com/ofaaland/lustre/tree/b-test-b210-push

Can you either try pushing that branch for me, or help me troubleshoot this? I don't know if the error message is coming from a script I can inspect to understand what's going on. It doesn't seem to be from anything under contrib in lustre.

thanks

Comment by Lai Siyao [ 04/Jun/20 ]

I tried the first patch, and it looks working: https://review.whamcloud.com/#/c/38826/.

Comment by Olaf Faaland [ 04/Jun/20 ]

Thanks Lai, that helped me figure it out. Looks like it was the format of my commit messages.

Comment by Olaf Faaland [ 16/Jun/20 ]

We're moving along on our updates to 2.12 that I'm going to abandon this. Thank you for your help, though.

Generated at Sat Feb 10 03:00:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.