[LU-13360] getdents() against empty striped directory always returns 1365 dirents Created: 13/Mar/20 Updated: 16/Jun/20 Resolved: 16/Jun/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8, Lustre 2.12.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
Client: opal174 with lustre-2.10.8_5.chaos-1.ch6.x86_64 Our Lustre tags are on github: zfs-0.7 based servers |
||
| Attachments: |
|
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Created directory with "lfs mkdir -c2 <thepath>" on Lustre 2.10.8 based client. No error messages are reported on the console of the servers or the client. [faaland1@opal174 branch:master src] $ps -f UID PID PPID C STIME TTY TIME CMD faaland1 52122 52121 0 09:58 pts/0 00:00:00 -bash faaland1 59859 52122 99 10:16 pts/0 01:20:38 ls -l /p/lforge/faaland1/mdtest/mdt0 /p/lforge/faaland1/mdtest/mdt1 /p faaland1 64327 52122 0 11:36 pts/0 00:00:00 ps -f [faaland1@opal174 branch:master src] $strace -p 59859 2>&1 | head -n4 strace: Process 59859 attached getdents(3, /* 1365 entries */, 32768) = 32760 getdents(3, /* 1365 entries */, 32768) = 32760 getdents(3, /* 1365 entries */, 32768) = 32760 [faaland1@opal174 branch:master src] $ls -l /proc/59859/fd total 0 lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 0 -> /dev/pts/0 lrwx------ 1 faaland1 faaland1 64 Mar 13 10:41 1 -> /dev/pts/0 lrwx------ 1 faaland1 faaland1 64 Mar 13 10:16 2 -> /dev/pts/0 lr-x------ 1 faaland1 faaland1 64 Mar 13 10:41 3 -> /p/lforge/faaland1/mdtest/mdtcount2 The 1365 entries returned are always the same - 676 entries for ".." and 689 entries for "." 676 {d_ino=144115272398143489, d_off=0, d_reclen=24, d_name="..", d_type=DT_DIR}
689 {d_ino=144115339574197893, d_off=0, d_reclen=24, d_name=".", d_type=DT_DIR}
I mounted the filesystem on another Lustre 2.10.8 node and ls of that directory produces the same symptoms. I mounted the file system on a Lustre 2.12.4 node and ls of that directory behaves as normal - getdents() is called twice, once it returns 2 entries, second time it returns 0 entries and 0 bytes (end of directory). |
| Comments |
| Comment by Olaf Faaland [ 13/Mar/20 ] |
|
For my tracking purposes, my internal ticket is TOSS4558 |
| Comment by Olaf Faaland [ 13/Mar/20 ] |
|
Contents of the two FIDs reported by lfs getdirstripe, in case it helps. lfs getdirstripe: $lfs getdirstripe /p/lforge/faaland1/mdtest/mdtcount2
lmv_stripe_count: 2 lmv_stripe_offset: 0 lmv_hash_type: fnv_1a_64
mdtidx FID[seq:oid:ver]
0 [0x200002340:0x7:0x0]
1 [0x280002340:0x7:0x0]
MDT0: [root@zwicky1:toss4558]# find . -name '*0x200002340:0x7:0x0*' 2> /dev/null ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0 ./oi.64/0x200002340:0x7:0x0 [root@zwicky1:toss4558]# ls -l ./oi.64/0x200002340:0x7:0x0 ls: cannot access ./oi.64/0x200002340:0x7:0x0: Input/output error [root@zwicky1:toss4558]# ls -al ./oi.23/0x200000417:0x1:0x0/mdtest/mdtcount2/[0x200002340:0x7:0x0]:0 total 22 drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 . drwx------ 4 faaland1 faaland1 2 Mar 13 10:10 .. MDT1: [root@zwicky2:toss4558]# find . -name "*0x280002340:0x7:0x0*" 2> /dev/null ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0 [root@zwicky2:toss4558]# ls -al ./oi.64/0x280002340:0x7:0x0 ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0 ls: cannot access ./REMOTE_PARENT_DIR/0x280002340:0x7:0x0: Input/output error ./oi.64/0x280002340:0x7:0x0: total 22 drwx------ 2 faaland1 faaland1 2 Mar 13 10:10 . drwxr-xr-x 0 root root 0 Dec 31 1969 .. |
| Comment by Peter Jones [ 13/Mar/20 ] |
|
Lai Could you please investigate Peter |
| Comment by Olaf Faaland [ 13/Mar/20 ] |
|
Hello Lai, I've attached debug.toss-4558.a.tar.gz which contains debug logs from two clients. debug was +rpctrace and +vfstrace. An ls -al of the same striped directory was performed on each of the two nodes. opal64: Lustre 2.12, the ls is successful and shows "." and ".." and exits. |
| Comment by Lai Siyao [ 16/Mar/20 ] |
|
2.12 contains several fixes for readdir of striped directory: You can apply them and try again. If you can enable 'trace' in debuglog, it can help identify the exact cause, however IMO it should have been fixed by the above patches. |
| Comment by Olaf Faaland [ 03/Jun/20 ] |
|
Hi Lai, The patch stack is here: and the error I get is: [faaland1@oslic5 branch:b-toss-4558-stripedir lustre-210] $git push wcrev HEAD:refs/for/b2_10 Enter passphrase for key '/g/g0/faaland1/.ssh/swdev': Counting objects: 126, done. Delta compression using up to 36 threads. Compressing objects: 100% (67/67), done. Writing objects: 100% (95/95), 87.24 KiB | 0 bytes/s, done. Total 95 (delta 69), reused 36 (delta 28) remote: Resolving deltas: 100% (69/69) remote: Processing changes: refs: 1, done To ssh://review.whamcloud.com/fs/lustre-release ! [remote rejected] HEAD -> refs/for/b2_10 (not Signed-off-by author/committer/uploader in commit message footer) error: failed to push some refs to 'ssh://review.whamcloud.com/fs/lustre-release' It looks straightforward, but I do not see patches without the Signed-off-by matching the author, and I also found that even just pushing a branch with just one commit produces the same result: Can you either try pushing that branch for me, or help me troubleshoot this? I don't know if the error message is coming from a script I can inspect to understand what's going on. It doesn't seem to be from anything under contrib in lustre. thanks |
| Comment by Lai Siyao [ 04/Jun/20 ] |
|
I tried the first patch, and it looks working: https://review.whamcloud.com/#/c/38826/. |
| Comment by Olaf Faaland [ 04/Jun/20 ] |
|
Thanks Lai, that helped me figure it out. Looks like it was the format of my commit messages. |
| Comment by Olaf Faaland [ 16/Jun/20 ] |
|
We're moving along on our updates to 2.12 that I'm going to abandon this. Thank you for your help, though. |