[LU-3029] Directory listings are unreliable Created: 25/Mar/13  Updated: 20/May/13  Resolved: 20/May/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.4
Fix Version/s: Lustre 2.4.0, Lustre 2.1.6

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File LU-3029.tar.bz2     Text File htree_output.txt     Text File rzstagg0_lustre_log_normal2.txt    
Severity: 3
Rank (Obsolete): 7394

 Description   

We are using lustre 2.1.4-3chaos on our server clusters.

Running a test application, one of our archive storage folks discovered that Lustre's directory listings are rather unreliable. The first thing she noticed is that directory entries can appear multiple times:

> cd /p/lscratchrza/apotts/divt_rzstagg0/htar_1st_27475
> find . -type f > ../test.lst0 ; echo $? ; wc -l ../test.lst0
0
34339 ../test.lst0
> find . -type f > ../test.lst1 ; echo $? ; wc -l ../test.lst1
0
35006 ../test.lst1

When the two directory listings are sorted and run through uniq, there are only 34339 unique entries.

One of our sysadmins investigated, and further found that sometimes entry listing are missing altogether. But when the missing files are checked with an ls, they are present.

This has been noticed with the above find command, and also using "/bin/ls -laR .". Both files and subdirectories have appeared twice in the directory listing.

The Lustre clients that have reproduced this behaviour are running 2.1.2-4chaos and 1.8.5.0-6chaos.



 Comments   
Comment by Andreas Dilger [ 26/Mar/13 ]

Do you have the actual directory listings? Can you please attach, along with a full debug log of the misbehaving "ls -l" run on the client v

Comment by Peter Jones [ 26/Mar/13 ]

Fan Yong

Could you please comment on this one?

Thanks

Peter

Comment by Oleg Drokin [ 26/Mar/13 ]

I assume the directory was not undergoing any sort of ongoing modifications at the ls time?

Comment by Christopher Morrone [ 27/Mar/13 ]

No modifications are under way when this happens. This is a bit of a Heisen-bug. Everytime I try to get reasonably detailed logs I fail to reproduce it. I finally got one with "+ rpctrace vfstrace" on the 1.8 client.

See the file LU-3029.tar.bz2 attached to this ticket. I haven't even had a chance to look through and see if there is anything useful in the log. I'll look more closely tomorrow.

The tar ball inside the tar ball is the directory tree of empty files that we used to reproduce the problem. Note that the path where that tree lived looked like:

/p/lscratchrza/py/TOSS-1982.DIVT_data_corruption_seen_in_OCF/dst/find-test-dir.1000.10_dir

"/p/lscratchrza" is the mount point for the lustre filesystem.

Comment by Christopher Morrone [ 27/Mar/13 ]

Anyone had a chance to look through the logs yet? This code is all new to me...like why "off", which one might naively think is the offset into the directory jumps from 0 to 188 Million. But I suppose this is some kind of hash, and not really a byte offset.

Comment by Christopher Morrone [ 28/Mar/13 ]

So, if I understand correctly, the server somewhere returns a pointer as part of to the next in the reply of the current directory page. That is not a real byte-offset into the directory listing, but instead a hash key...which may be the first filename in the next directory page run through the hashing function.

Does this 2.1.4 server code support fuzzy hash lookups, or does that hash key need to match the next page exactly? If it is fuzzy, then perhaps there is some RPC buffer that the server is failing to correctly intialize or somesuch, and the hash key is sometimes incorrect. Then when the client passes that incorrect hash key to the server requesting the wrong key, it just finds the closest which can result in duplicate listings rather than failing outright.

It is not clear to me yet if this was a regression when we upgraded our servers from 2.1.2-4chaos to 2.1.4-3chaos, but the timing of the discovery coincides.

Note that we're running our 2.1.4 servers on RHEL6.4.

Comment by nasf (Inactive) [ 28/Mar/13 ]

Lustre uses name-hash as the readdir() "cookie". Currently, it is 64-bits hash, which does not guarantee the hash value is unique. But it is impossible to have so many hash collision for so small files set. So there must be something wrong. For the first step, we need to make clear where the issue is: on client side or MDS side.

Would you please to mount MDT device as ldiskfs, and try to list the directory directly to check whether can reproduce the issues or not. If yes, then the issue is in ldiskfs; otherwise, it is quite possible that the issue is related with client side directory page cache.

BTW, what is the client/server architecture? x86 or ppc? 64bit, right?

Comment by Lai Siyao [ 28/Mar/13 ]

To my understanding, ext4 lookup by fuzzy hash lookup (entry with the largest hash value but <= give hash is returned), and MDS server just calls ext4 readdir. The log file doesn't show hash collision occurs.

Comment by Christopher Morrone [ 28/Mar/13 ]

This is a production filesystem, so I would like to avoid taking it off line to mount and test through ldiskfs. But if we run out of leads, it is not outside the realm of possibility.

Here is an interesting comparison of ll_dir_readpage_20() calls when the directory listing is correct, and when it contains duplicates:

Log excerpt for readdir containing duplicates

00000080:00200000:1:1364344933.644546:0:5184:0:(file.c:3334:__ll_inode_revalidate_it()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0),name=4
00000080:00200000:1:1364344945.796752:0:5184:0:(file.c:419:ll_file_open()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0), flags 304000
00000080:00200000:1:1364344945.796826:0:5184:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) pos 0/28672 32bit_api 0
00000080:00200000:1:1364344945.796841:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 0
00000080:00200000:1:1364344945.798391:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 188103822
00000080:00200000:1:1364344945.799894:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 409380093
00000080:00200000:1:1364344945.801468:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 644114478
00000080:00200000:1:1364344945.802995:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 874128427
00000080:00200000:1:1364344945.804499:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1066856208
00000080:00200000:1:1364344945.805717:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1302404326
00000080:00200000:1:1364344945.806960:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1507621530
00000080:00200000:1:1364344945.808445:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 826569011743784587
00000080:00200000:1:1364344945.810259:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1758724032027132756
00000080:00200000:1:1364344945.811593:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 2793169164989221521
00000080:00200000:0:1364344947.340167:0:5184:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) pos 2896977738723571359/28672 32bit_api 0
00000080:00200000:0:1364344947.340220:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 3771936052708445380
00000080:00200000:0:1364344947.341422:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 4610195316726557684
00000080:00200000:0:1364344947.342567:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 5611094818773459366
00000080:00200000:0:1364344947.343556:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 6476384681056065825
00000080:00200000:0:1364344947.344793:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 7542137920889152905
00000080:00200000:0:1364344947.346138:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 853375364
00000080:00200000:0:1364344947.347646:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1043338592
00000080:00200000:0:1364344947.348929:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 826569011743784587
00000080:00200000:0:1364344947.350249:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 1758724032027132756
00000080:00200000:0:1364344947.351281:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 2793169164989221521
00000080:00200000:0:1364344947.352432:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 3771936052708445380
00000080:00200000:0:1364344948.648125:0:5184:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) pos 3950576220225407746/28672 32bit_api 0
00000080:00200000:0:1364344948.648174:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 4610195316726557684
00000080:00200000:0:1364344948.649352:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 5611094818773459366
00000080:00200000:0:1364344948.650510:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 6476384681056065825
00000080:00200000:0:1364344948.651685:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 7542137920889152905
00000080:00200000:0:1364344948.652933:0:5184:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) off 8390415131186758359
00000080:00200000:0:1364344949.493383:0:5184:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0) pos 18446744073709551614/28672 32bit_api 0
00000080:00200000:0:1364344949.493398:0:5184:0:(file.c:256:ll_file_release()) VFS Op:inode=6488959042059244690/0(ffff810131bf65e0)

Log excerpt for readdir that is correct

00000080:00200000:3:1364493691.865625:0:30650:0:(file.c:3334:__ll_inode_revalidate_it()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560),name=4
00000080:00200000:3:1364493691.867438:0:30650:0:(file.c:419:ll_file_open()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560), flags 304000
00000080:00200000:3:1364493691.867524:0:30650:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) pos 0/28672 32bit_api 0
00000080:00200000:3:1364493691.867550:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 0
00000080:00200000:3:1364493691.868721:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 188103822
00000080:00200000:3:1364493691.869942:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 409380093
00000080:00200000:3:1364493691.871309:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 644114478
00000080:00200000:3:1364493691.872425:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 874128427
00000080:00200000:3:1364493691.873656:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 1066856208
00000080:00200000:3:1364493691.874853:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 1302404326
00000080:00200000:3:1364493691.875957:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 1507621530
00000080:00200000:3:1364493691.877072:0:30650:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) off 1755486050
00000080:00200000:3:1364493691.879309:0:30650:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560) pos 18446744073709551614/28672 32bit_api 0
00000080:00200000:3:1364493691.879326:0:30650:0:(file.c:256:ll_file_release()) VFS Op:inode=6488959042059244690/0(ffff8101fe555560)
Comment by Christopher Morrone [ 28/Mar/13 ]

See attached file rzstagg0_lustre_log_normal2.txt, which contains the lustre log of the client when the directory listing is correct.

Comment by Christopher Morrone [ 30/Mar/13 ]

Clients and server are x86_64 architecture.

Comment by nasf (Inactive) [ 01/Apr/13 ]

Currently, for 64bit nodes, Lustre uses 64bits hash for directory cookie, which contains high-32bits major hash and low-32bits minor hash. Usually, the major hash is non-zero. So the "off" shown in above log should be larger than 2^32. But for the case of listing correct, all the "off" are smaller than 2^32. So I wonder whether the "64bithash" is supported on your system or not. You can verify the client "connect_flags".

I have tried to run 2.1.4-3chaos server + 2.1.2-4chaos client, but failed to reproduce your issues, and the "off" from my log is larger than 2^32.

Comment by Christopher Morrone [ 02/Apr/13 ]

Yes, I see the 64bithash flag listed for the mdc. Whether or not there are actually bits set in the upper 32 bits of the hash will depend entirely on the hashing algorithm. Since seeing only the lower 32bits used is the working case, I would assume that is just normal for the set of file names used, is it not?

Comment by Christopher Morrone [ 03/Apr/13 ]

FYI, the filesystems where we see this problem were originally formatted using Lustre 1.6. Could that mean that they are stuck with 32bit hashes or something, but not properly advertising that information?

Comment by Andreas Dilger [ 09/Apr/13 ]

Chris, for the directory that was used to generate the good/bad output above, could you please log into the MDS and run:

debugfs -c -R "htree /ROOT/{path to directory}" /dev/{mdsdev}

and attach the output. I'd like to see if the hash values for this directory match any of the hash values returned during readdir (both good and bad ones). If you don't have the original directory, it should be OK to untar the test files and run the debugfs "htree" command on the new directory, so long as it is on the same MDS, since it needs the same MDS UUID to seed the hash function.

The attached log file appears to have been captured on a 1.8.x client, but you report the same problem on 2.1.x clients?

Comment by Christopher Morrone [ 09/Apr/13 ]

Sure thing. I think the directory may be gone now, but I can list a tree of identical contents like you suggest.

Yes, the 2.1 clients definitely see a directory listing problem as well. For whatever reason, our 1.8 client just seems to be able to reproduce the problem more frequently, so it was a more convenient debug environment.

Comment by Christopher Morrone [ 09/Apr/13 ]

See attached file htree_output.txt.

Comment by Christopher Morrone [ 09/Apr/13 ]

I am assuming that ext htree hashes are what get mapped into the hash keys that the client sees somehow. Is that correct? Is it the IAM stuff what does that? Can you give an overview of how this is supposed to work.

Also, I note that there is a difference between the 1.8 and 2.1 client operation that may help explain why the 1.8 client sees this problem so much more reliably. The 1.8 client seems to perform a separate rpc for every page of the directory listing, whereas the 2.1 client performs only a single rpc, and then ll_dir_page_locate() finds the pages already in local memory as the client traverses the directory listing.

Comment by Prakash Surya (Inactive) [ 15/Apr/13 ]

OK, so I'm still not sure what to make of it, but here's some info that I've collected. The client is running 1.8.5-6chaos and the server 2.1.4-4chaos. We've added this debug patch to gather data from the server:

                                                                      
diff --git a/lustre/mdd/mdd_object.c b/lustre/mdd/mdd_object.c                  
index 4fb0525..efd5c52 100644                                                   
--- a/lustre/mdd/mdd_object.c                                                   
+++ b/lustre/mdd/mdd_object.c                                                   
@@ -2358,6 +2358,11 @@ out:                                                     
                         dp->ldp_flags |= cpu_to_le32(LDF_COLLIDE);             
                 last->lde_reclen = 0; /* end mark */                           
         }                                                                      
+        CDEBUG(D_INFO,                                                         
+               "lu_dirpage ldp_hash_start "LPU64                               
+               " ldp_hash_end "LPU64" ldp_flags %u\n",                         
+               le64_to_cpu(dp->ldp_hash_start), le64_to_cpu(dp->ldp_hash_end), 
+               le32_to_cpu(dp->ldp_flags));                                    
         return result;                                                         
 }                                                                              

On a successful run, I see the following in the logs..

Server:

                                                                      
# levi-mds1 /root > cat /tmp/llog | grep -e "172.16.56.1@tcp1:37" -e "mdd_dir_page_build" | less
...                                                                             
00000100:00100000:1.0:1366065858.326421:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058556:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.326780:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 0 ldp_hash_end 225459756 ldp_flags 0
00000100:00100000:1.0:1366065858.327282:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058556:12345-172.16.56.1@tcp1:37 Request procesed in 866us (894us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.327773:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058557:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.328063:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 225459756 ldp_hash_end 466119302 ldp_flags 0
00000100:00100000:1.0:1366065858.328758:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058557:12345-172.16.56.1@tcp1:37 Request procesed in 991us (1018us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.329259:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058558:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.329567:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 466119302 ldp_hash_end 678250957 ldp_flags 0
00000100:00100000:1.0:1366065858.329975:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058558:12345-172.16.56.1@tcp1:37 Request procesed in 721us (765us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.330489:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058559:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.330776:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 678250957 ldp_hash_end 885661942 ldp_flags 0
00000100:00100000:1.0:1366065858.331279:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058559:12345-172.16.56.1@tcp1:37 Request procesed in 795us (820us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.331789:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058560:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.332121:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 885661942 ldp_hash_end 1071938997 ldp_flags 0
00000100:00100000:1.0:1366065858.332578:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058560:12345-172.16.56.1@tcp1:37 Request procesed in 793us (819us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.333089:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058561:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.333381:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1071938997 ldp_hash_end 1255106162 ldp_flags 0
00000100:00100000:1.0:1366065858.333841:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058561:12345-172.16.56.1@tcp1:37 Request procesed in 757us (787us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.334455:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058562:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.334720:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1255106162 ldp_hash_end 1502393138 ldp_flags 0
00000100:00100000:1.0:1366065858.335113:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058562:12345-172.16.56.1@tcp1:37 Request procesed in 663us (691us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.335605:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058563:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.335889:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1502393138 ldp_hash_end 1713543192 ldp_flags 0
00000100:00100000:1.0:1366065858.336477:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058563:12345-172.16.56.1@tcp1:37 Request procesed in 878us (903us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.336974:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058564:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.337258:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1713543192 ldp_hash_end 1968026834 ldp_flags 0
00000100:00100000:1.0:1366065858.337682:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058564:12345-172.16.56.1@tcp1:37 Request procesed in 712us (737us total) trans 0 rc 0/0
00000100:00100000:1.0:1366065858.338224:0:4488:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+8:17092:x1384931791058565:12345-172.16.56.1@tcp1:37
00000004:00000040:1.0:1366065858.338450:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1968026834 ldp_hash_end 2145666659 ldp_flags 2
00000100:00100000:1.0:1366065858.339019:0:4488:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_08:7a28b574-36a7-8318-6710-e07c4d33fe7f+7:17092:x1384931791058565:12345-172.16.56.1@tcp1:37 Request procesed in 801us (826us total) trans 0 rc 0/0
...                                                                             

Client:

                                                                      
# rzstagg0 /root > cat /tmp/llog | grep -e "172.16.69.140@tcp1:37" -e ll_dir_readpage_20 -e ll_readdir_20 | less
00000080:00200000:1:1366065858.333206:0:17092:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) pos 0/28672 32bit_api 0
00000080:00200000:1:1366065858.333736:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 0
00000100:00100000:1:1366065858.333786:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058556:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.335007:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058556:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.335120:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 225459756
00000100:00100000:1:1366065858.335165:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058557:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.336471:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058557:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.336581:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 466119302
00000100:00100000:1:1366065858.336626:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058558:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.337684:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058558:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.337793:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 678250957
00000100:00100000:1:1366065858.337838:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058559:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.338996:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058559:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.339107:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 885661942
00000100:00100000:1:1366065858.339151:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058560:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.340287:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058560:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.340400:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 1071938997
00000100:00100000:1:1366065858.340444:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058561:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.341565:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058561:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.341676:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 1255106162
00000100:00100000:1:1366065858.341720:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058562:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.342825:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058562:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.342947:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 1502393138
00000100:00100000:1:1366065858.342991:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058563:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.344190:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058563:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.344297:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 1713543192
00000100:00100000:1:1366065858.344344:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058564:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.345424:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058564:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.345531:0:17092:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) off 1968026834
00000100:00100000:1:1366065858.345578:0:17092:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058565:172.16.69.140@tcp1:37
00000100:00100000:1:1366065858.346731:0:17092:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc checkdir:7a28b574-36a7-8318-6710-e07c4d33fe7f:17092:x1384931791058565:172.16.69.140@tcp1:37
00000080:00200000:1:1366065858.352504:0:17092:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6510459259514373082/0(ffff81011d1d9920) pos 18446744073709551614/28672 32bit_api 0

That looks reasonable. Compare that with the debugging information I got from a failed iteration:

Server:

                                                                      
00000100:00100000:8.0:1365797000.191235:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112744:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.191659:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 0 ldp_hash_end 225459756 ldp_flags 0
00000100:00100000:8.0:1365797000.192620:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112745:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.192979:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 225459756 ldp_hash_end 466119302 ldp_flags 0
00000100:00100000:8.0:1365797000.193949:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112746:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.194312:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 466119302 ldp_hash_end 678250957 ldp_flags 0
00000004:00000040:8.0:1365797000.195133:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 901471152 ldp_hash_end 1817119367 ldp_flags 2
00000100:00100000:10.0:1365797000.195262:0:4492:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_12:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112747:12345-172.16.56.1@tcp1:37
00000004:00000040:10.0:1365797000.195510:0:4492:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 4294967296 ldp_hash_end 972141167607355620 ldp_flags 0
00000100:00100000:8.0:1365797000.196513:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112748:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.196876:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1190281883 ldp_hash_end 1359387480 ldp_flags 0
00000100:00100000:8.0:1365797000.197841:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112749:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.198202:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1359387480 ldp_hash_end 1642720203 ldp_flags 0
00000100:00100000:4.0:1365797000.199181:0:4504:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_24:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112750:12345-172.16.56.1@tcp1:37
00000004:00000040:4.0:1365797000.199570:0:4504:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1642720203 ldp_hash_end 1852935977 ldp_flags 0
00000100:00100000:4.0:1365797000.200569:0:4504:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_24:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112751:12345-172.16.56.1@tcp1:37
00000004:00000040:4.0:1365797000.200947:0:4504:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1852935977 ldp_hash_end 2074638760 ldp_flags 0
00000100:00100000:10.0:1365797000.201900:0:4492:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_12:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112752:12345-172.16.56.1@tcp1:37
00000004:00000040:10.0:1365797000.202099:0:4492:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 4294967296 ldp_hash_end 972141167607355620 ldp_flags 0
00000100:00100000:10.0:1365797000.203058:0:4492:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_12:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112753:12345-172.16.56.1@tcp1:37
00000004:00000040:10.0:1365797000.203280:0:4492:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 972141167607355620 ldp_hash_end 2012764699691245150 ldp_flags 0
00000100:00100000:4.0:1365797000.204215:0:4504:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_24:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112754:12345-172.16.56.1@tcp1:37
00000004:00000040:4.0:1365797000.204542:0:4504:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 227591098 ldp_hash_end 471243777 ldp_flags 0
00000100:00100000:4.0:1365797000.205489:0:4504:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_24:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112755:12345-172.16.56.1@tcp1:37
00000004:00000040:4.0:1365797000.205810:0:4504:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 471243777 ldp_hash_end 680804604 ldp_flags 0
00000100:00100000:4.0:1365797000.206662:0:4504:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_24:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112756:12345-172.16.56.1@tcp1:37
00000004:00000040:8.0:1365797000.206741:0:4132:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 901471152 ldp_hash_end 1817119367 ldp_flags 2

Client:

                                                                      
00000080:00200000:2:1365797000.198307:0:16638:0:(dir.c:849:ll_readdir_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) pos 0/28672 32bit_api 0
00000080:00200000:2:1365797000.198326:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 0
00000100:00100000:2:1365797000.198354:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112744:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.199570:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112744:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.199631:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 225459756
00000100:00100000:2:1365797000.199656:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112745:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.200897:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112745:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.200954:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 466119302
00000100:00100000:2:1365797000.200979:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112746:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.202160:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112746:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.202216:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 678250957
00000100:00100000:2:1365797000.202241:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112747:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.203412:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112747:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.203472:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 972141167607355620
00000100:00100000:2:1365797000.203498:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112748:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.204752:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112748:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.204800:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 1359387480
00000100:00100000:2:1365797000.204824:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112749:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.206119:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112749:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.206175:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 1642720203
00000100:00100000:2:1365797000.206199:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112750:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.207456:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112750:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.207510:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 1852935977
00000100:00100000:2:1365797000.207535:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112751:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.208804:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112751:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.208859:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 2074638760
00000100:00100000:2:1365797000.208883:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112752:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.209951:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112752:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.210012:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 972141167607355620
00000100:00100000:2:1365797000.210038:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112753:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.211135:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112753:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.211199:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 2012764699691245150
00000100:00100000:2:1365797000.211224:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112754:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.212377:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112754:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.212424:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 471243777
00000100:00100000:2:1365797000.212449:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112755:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.213640:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112755:172.16.69.140@tcp1:37
00000080:00200000:2:1365797000.213694:0:16638:0:(dir.c:600:ll_dir_readpage_20()) VFS Op:inode=6510459259514373082/0(ffff81015fd30560) off 680804604
00000100:00100000:2:1365797000.213720:0:16638:0:(client.c:2102:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112756:172.16.69.140@tcp1:37
00000100:00100000:2:1365797000.214881:0:16638:0:(client.c:2207:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc perl_test:7a28b574-36a7-8318-6710-e07c4d33fe7f:16638:x1384931780112756:172.16.69.140@tcp1:37

In this "bad" case, things seem to go wrong with transaction x1384931780112746. Two calls to mdd_dir_page_build occur when only a single call should have been made (I think, comparing with the good trace). And then in the next transaction (x1384931780112747) where the client requests the hash value 678250957, the server returns "garbage" (again, when compared to the "good" test):

                                                                      
00000100:00100000:10.0:1365797000.195262:0:4492:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_12:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112747:12345-172.16.56.1@tcp1:37
00000004:00000040:10.0:1365797000.195510:0:4492:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 4294967296 ldp_hash_end 972141167607355620 ldp_flags 0
00000100:00100000:8.0:1365797000.196513:0:4132:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_03:7a28b574-36a7-8318-6710-e07c4d33fe7f+2767:16638:x1384931780112748:12345-172.16.56.1@tcp1:37

So the root cause appears to be related to that extra call to mdd_dir_page_build occurring on the server.

Comment by nasf (Inactive) [ 16/Apr/13 ]

I did not find 1.8.5-6chaos code, just check the code 1.8.5.0-6chaos:
lustre/llite/llite_lib.c

        data->ocd_connect_flags = OBD_CONNECT_VERSION | OBD_CONNECT_IBITS      |
                                  OBD_CONNECT_JOIN    | OBD_CONNECT_ATTRFID    |
                                  OBD_CONNECT_NODEVOH | OBD_CONNECT_CANCELSET  |
                                  OBD_CONNECT_AT      | OBD_CONNECT_FID        |
                                  OBD_CONNECT_VBR     | OBD_CONNECT_LOV_V3     |
                                  OBD_CONNECT_64BITHASH;

That means 1.8.5.0-6chaos client support 64bit dirhash.

I also checked 2.1.4-4chaos code:
lustre/mdt/mdt_handler.c

        if (info->mti_exp->exp_connect_flags & OBD_CONNECT_64BITHASH)
                rdpg->rp_attrs |= LUDA_64BITHASH;

lustre/osd-ldiskfs/osd_handler.c

        if (attr & LUDA_64BITHASH)
		it->oie_file.f_mode |= FMODE_64BITHASH;
        else
		it->oie_file.f_mode |= FMODE_32BITHASH;

That means 2.1.4-4chaos server support 64bit dirhash also. So I think the dirhash listed as above should be 64-bits. But they are 32-bits only, in spite of successful cases or failure cases. I do not know what happened.

As for the additional mdd_dir_page_build() call, according to the 2.1.4-4chaos implementation, it is possible:

        for (i = 0, nob = rdpg->rp_count; rc == 0 && nob > 0;
             i++, nob -= CFS_PAGE_SIZE) {
                struct lu_dirpage *dp;

                LASSERT(i < rdpg->rp_npages);
                pg = rdpg->rp_pages[i];
                dp = cfs_kmap(pg);
#if CFS_PAGE_SIZE > LU_PAGE_SIZE
repeat:
#endif
                rc = mdd_dir_page_build(env, mdd, dp,
                                        min_t(int, nob, LU_PAGE_SIZE),
                                        iops, it, rdpg->rp_attrs);
                if (rc > 0) {
                        /*
                         * end of directory.
                         */
                        dp->ldp_hash_end = cpu_to_le64(MDS_DIR_END_OFF);
                        nlupgs++;
                } else if (rc < 0) {
                        CWARN("build page failed: %d!\n", rc);
                } else {
                        nlupgs++;
#if CFS_PAGE_SIZE > LU_PAGE_SIZE
                        dp = (struct lu_dirpage *)((char *)dp + LU_PAGE_SIZE);
                        if ((unsigned long)dp & ~CFS_PAGE_MASK)
                                goto repeat;
#endif
                }
                cfs_kunmap(pg);
        }

So as long as "rdpg->rp_count" is larger than CFS_PAGE_SIZE, then there will be multiple mdd_dir_page_build() call for single readdir RPC. Usually, the "rdpg->rp_count" is assigned by client side parameter via mdt_body::nlink, which should be CFS_PAGE_SIZE. But if something wrong to overwrite such filed, then confusing cases will happen. Consider the unclearly 64bit-dirhash issues, I suspect that there may be some interoperability issues.

Comment by Prakash Surya (Inactive) [ 16/Apr/13 ]

So as long as "rdpg->rp_count" is larger than CFS_PAGE_SIZE, then there will be multiple mdd_dir_page_build() call for single readdir RPC. Usually, the "rdpg->rp_count" is assigned by client side parameter via mdt_body::nlink, which should be CFS_PAGE_SIZE. But if something wrong to overwrite such filed, then confusing cases will happen. Consider the unclearly 64bit-dirhash issues, I suspect that there may be some interoperability issues.

Right, that's the only way I can see mdd_dir_page_build getting called a second time as well. I'm unsure why rp_count would be anything other than CFS_PAGE_SIZE when using a 1.8.5 based client though.

Also, I don't quite understand the values printed in the second call to mdd_dir_page_build of the "bad" case when compared to the "good" case. Specifically, in xid x1384931791058559 of the "good" test, we ask for hash 67825097 and it builds a dirpage like so:

00000004:00000040:1.0:1366065858.330776:0:4488:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 678250957 ldp_hash_end 885661942 ldp_flags 0

But when that same hash is asked for in the "bad" test, in xid x1384931780112747, the server builds a dirpage with these values:

00000004:00000040:10.0:1365797000.195510:0:4492:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 4294967296 ldp_hash_end 972141167607355620 ldp_flags 0

I'm curious why the server is returning different results for the same operation of looking up hash value 67825097. Even with the extra call to mdd_dir_build_page, it looks like the client still asks for the correct next hash of 67825097, but gets a completely different answer from the server.

How are these hash values being built up? I used debugfs to look at the ldiskfs htree hash values for this directory (as Andreas suggested above), but I'm not sure how those values map to the lustre hash indices.

Comment by nasf (Inactive) [ 19/Apr/13 ]

For ldiskfs-based MDS, the cookie (or position, or client visible 'dirhash') returned to client for readdir() is composed of two ldiskfs internal hash values: major hash (32-bits) and minor hash (32-bits). If client support '64bithash', then the cookie is 64-bits with both major hash and minor hash; otherwise only major hash will be used to generate the cookie. All the things has been processed inside ldiskfs, NOT involve IAM.
I have tested lustre-2.1.4-4chaos server + lustre-1.8.5.0-6chaos patchless client, with new formatted ldiskfs (NOT upgraded from lustre-1.6). I found that the '64bithash' is supported, and the dir cookie for communication between client and MDS are 64-bits, and each time only one page is fetched from MDS for readdir(). All are expected. So NOT reproduced your issues.

According to your system log, there are several questionable points:

1) Your MDT device was formatted with Lustre-1.6. Is it possible that the hash_seed (ldiskfs_super_block::s_hash_seed) has NOT been set correctly or crashed during the system using? So would you please add some debug info (or just printk) in ldiskfs to print out such hash_seed?

2) On the other hand, we can NOT explain why mdd_dir_page_build() was called twice under failure cases. I think that CFS_PAGE_SIZE is equal to LU_PAGE_SIZE in your system, otherwise, it should repeat mdd_dir_page_build() for successful cases also. So I suspect some RPC parameters were NOT correctly set or crashed for failed readdir() cases. Would you please to add some debug info in lustre/mdt/mdt_handle.c as following:

static int mdt_readpage(struct mdt_thread_info *info)
...
                                PTLRPC_MAX_BRW_SIZE);
        rdpg->rp_npages = (rdpg->rp_count + CFS_PAGE_SIZE - 1) >>
                          CFS_PAGE_SHIFT;
+        CDEBUG(D_INFO,
+               "readdir: hash = %Lx, attr = %x, count = %u, npages = %u\n",
+               rdpg->rp_hash, rdpg->rp_attrs, rdpg->rp_count, rdpg->rp_npages);
        OBD_ALLOC(rdpg->rp_pages, rdpg->rp_npages * sizeof rdpg->rp_pages[0]);
        if (rdpg->rp_pages == NULL)
                RETURN(-ENOMEM);
Comment by Andreas Dilger [ 19/Apr/13 ]

Is it possible this problem relates to LU-3182 and http://review.whamcloud.com/6071?

Comment by Ned Bass [ 19/Apr/13 ]

Andreas, they seem to be unrelated (other than having similar symptoms). LU-3182 was introduced by an lmv cleanup patch between 2.3.58 and 2.3.59, while this bug is affecting 2.1 and 1.8 systems. Also, LU-3182 should only affect PPC clients due to large page size.

Comment by nasf (Inactive) [ 23/Apr/13 ]

Can anybody help to reproduce the issues with the debug information mentioned in my comment 19/Apr/13 4:39 PM? Thanks very much!

Comment by Ned Bass [ 23/Apr/13 ]

nasf, I checked the MDTs with tunefs.ldiskfs -l, and the directory hash seeds look valid, i.e.

Directory Hash Seed: 9b006d00-8c7b-4ec9-995a-c99307066c69

I assumed but didn't verify that output prints ldiskfs_super_block::s_hash_seed.

I believe Prakash has been trying to reproduce the issue with that patch and other debugging in place, but hasn't gotten a hit yet.

Comment by Prakash Surya (Inactive) [ 23/Apr/13 ]

Sorry for the lack of communication. I've been trying to get the issue to hit with the debug info from the nasf's comment above but haven't reproduced until today.

Using systemtap I've probed a few different places in the call path of interest. Specifically, the entry and exit points of mdd_readpage, osd_it_ea_load, osd_it_ea_key_size, osd_it_ea_store, osd_it_ea_rec, and osd_it_ea_next.

The entry point of mdd_readpage now prints:

                                                                      
        CDEBUG(D_INFO,                                                          
                "LU-3029: rdpg = %p, hash = "LPU64", attr = %x, count = %u, npages = %u\n",
                rdpg, rdpg->rp_hash, rdpg->rp_attrs, rdpg->rp_count, rdpg->rp_npages);

The exit point of mdd_readpage prints:

                                                                      
        CDEBUG(D_INFO,                                                          
                "LU-3029: rdpg = %p, hash = "LPU64", attr = %x, count = %u, npages = %u\n",
                rdpg, rdpg->rp_hash, rdpg->rp_attrs, rdpg->rp_count, rdpg->rp_npages);
                                                                                
        for (i = 0, nob = rdpg->rp_count; i < rdpg->rp_npages && nob > 0;          
             i++, nob -= CFS_PAGE_SIZE) {                                       
                struct lu_dirpage *dp = cfs_kmap(rdpg->rp_pages[i]);            
                CDEBUG(D_INFO,                                                  
                        "LU-3029: rdpg: %p, lu_dirpage ldp_hash_start "LPU64       
                        " ldp_hash_end "LPU64" ldp_flags %u\n",                 
                        rdpg, le64_to_cpu(dp->ldp_hash_start),                  
                        le64_to_cpu(dp->ldp_hash_end),                          
                        le32_to_cpu(dp->ldp_flags));                            
                cfs_kunmap(rdpg->rp_pages[i]);                                  
        }                                                                       

And the osd_it_ea_* functions print their trace information and some dump its osd_it_ea structure to D_INFO using:

                                                                      
        struct osd_it_ea *ea = (struct osd_it_ea *)STAP_ARG_osd_it_ea;          
        CDEBUG(D_INFO, "LU-3029: "                                              
                "osd_it_ea(%p) = { .oie_obj = %p, "                             
                                  ".oie_file = {...}, "                         
                                  ".oie_rd_dirent = %i, "                       
                                  ".oie_it_dirent = %i, "                       
                                  ".oie_dirent = %p, "                          
                                  ".oie_buf = %p }\n",                          
                                ea,                                             
                                ea->oie_obj,                                    
                                /*ea->oie_file,*/                               
                                ea->oie_rd_dirent,                              
                                ea->oie_it_dirent,                              
                                ea->oie_dirent,                                 
                                ea->oie_buf);                                   

and:

                                                                      
        struct osd_it_ea *ea = (struct osd_it_ea *)STAP_ARG_osd_it_ea;          
        CDEBUG(D_INFO, "LU-3029: osd_it_ea(%p)->oie_dirent(%p) = "              
                "{ .oied_fid = "DFID", "                                        
                  ".oied_ino = "LPU64", "                                       
                  ".oied_off = "LPU64", "                                       
                  ".oied_namelen = %i, "                                        
                  ".oied_type = %i, "                                           
                  ".oied_name = %s }\n",                                        
                ea, ea->oie_dirent,                                             
                PFID(&ea->oie_dirent->oied_fid),                                
                ea->oie_dirent->oied_ino,                                       
                ea->oie_dirent->oied_off,                                       
                ea->oie_dirent->oied_namelen,                                   
                ea->oie_dirent->oied_type,                                      
                ea->oie_dirent->oied_name);                                     

I'll attach the full system tap script and logs to this bug in a follow up comment, but just wanted to give my following debug messages some context.

So, in the "good" run (readdir produces the correct results), the calls to mdd_readpage look like so on the MDS:

                                                                      
00000001:00000040:4.0:1366671762.749937:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 0, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.755066:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 225459756, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.760068:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 466119302, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.764754:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 678250957, attr = 7, count = 4096, npages = 1 
00000001:00000040:6.0:1366671762.769742:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 885661942, attr = 7, count = 4096, npages = 1 
00000001:00000040:6.0:1366671762.774808:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1071938997, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.779841:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1255106162, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.784969:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1502393138, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.789711:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1713543192, attr = 7, count = 4096, npages = 1 
00000001:00000040:4.0:1366671762.794429:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1968026834, attr = 7, count = 4096, npages = 1 

But in the bad case (readdir produces the incorrect results), the calls looks like this:

                                                                      
00000001:00000040:9.0:1366734926.847245:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 0, attr = 7, count = 4096, npages = 1 
00000001:00000040:10.0:1366734926.852728:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 225459756, attr = 7, count = 4096, npages = 1 
00000001:00000040:10.0:1366734926.858534:0:4501:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff880403b28b28, hash = 466119302, attr = 7, count = 4096, npages = 1 
00000001:00000040:11.0:1366734926.862934:0:4501:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff880403b28b28, hash = 678250957, attr = 7, count = 4096, npages = 1 
00000001:00000040:6.0:1366734926.872498:0:4501:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff880403b28b28, hash = 885661942, attr = 7, count = 4096, npages = 1 
00000001:00000040:15.0:1366734926.887832:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 1071938997, attr = 7, count = 4096, npages = 1 
00000001:00000040:13.0:1366734926.894694:0:4501:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff880403b28b28, hash = 1255106162, attr = 7, count = 4096, npages = 1 
00000001:00000040:12.0:1366734926.904717:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 1502393138, attr = 7, count = 4096, npages = 1 
00000001:00000040:7.0:1366734926.910386:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 972141167607355620, attr = 7, count = 4096, npages = 1 
00000001:00000040:5.0:1366734926.916303:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 2012764699691245150, attr = 7, count = 4096, npages = 1 
00000001:00000040:5.0:1366734926.925118:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 2920144773176655063, attr = 7, count = 4096, npages = 1 
00000001:00000040:0.0:1366734927.917902:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 3811381997883059066, attr = 7, count = 4096, npages = 1 
00000001:00000040:0.0:1366734927.922300:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 1366741674, attr = 7, count = 4096, npages = 1 
00000001:00000040:0.0:1366734927.926697:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 1643777690, attr = 7, count = 4096, npages = 1 
00000001:00000040:0.0:1366734927.931104:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 1854943286, attr = 7, count = 4096, npages = 1 
00000001:00000040:0.0:1366734927.935557:0:4495:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88022acfe328, hash = 2076961697, attr = 7, count = 4096, npages = 1 

It looks pretty obvious that things go wrong in the transition from hash 1502393138 to hash 972141167607355620. But that's nothing new. So now, lets look a little closer at some of the more interesting RPCs and the call path they take.

OK, lets looks at the two RPCs where the client is asking for hash 1255106162. In both cases, things seem to work just fine, and consistent with each other. A brief trace of the "good" and "bad" cases are below:

"good" case:

                                                                      
00000100:00100000:4.0:1366671762.779825:0:4513:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_33:119d1f74-b5ac-9653-7a11-c72daccd58b7+11:5521:x1432697691423683:12345-172.16.56.1@tcp1:37
00000001:00000040:4.0:1366671762.779841:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1255106162, attr = 7, count = 4096, npages = 1
00000004:00000040:4.0:1366671762.783816:0:4513:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1255106162 ldp_hash_end 1502393138 ldp_flags 0
00000001:00000040:4.0:1366671762.783835:0:4513:0:(stap_27848_src.c:842:function_mdd_readpage_return()) LU-3029: rdpg = ffff88082cd89b28, hash = 1255106162, attr = 7, count = 4096, npages = 1
00000001:00000040:4.0:1366671762.783837:0:4513:0:(stap_27848_src.c:852:function_mdd_readpage_return()) LU-3029: rdpg: ffff88082cd89b28, lu_dirpage ldp_hash_start 1255106162 ldp_hash_end 1502393138 ldp_flags 0
00000100:00100000:4.0:1366671762.784444:0:4513:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_33:119d1f74-b5ac-9653-7a11-c72daccd58b7+10:5521:x1432697691423683:12345-172.16.56.1@tcp1:37 Request procesed in 4626us (4676us total) trans 0 rc 0/0

"bad" case:

                                                                      
00000100:00100000:13.0:1366734926.894677:0:4501:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_21:119d1f74-b5ac-9653-7a11-c72daccd58b7+177:1767:x1432697831399328:12345-172.16.56.1@tcp1:37
00000001:00000040:13.0:1366734926.894694:0:4501:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff880403b28b28, hash = 1255106162, attr = 7, count = 4096, npages = 1
00000004:00000040:13.0:1366734926.898359:0:4501:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1255106162 ldp_hash_end 1502393138 ldp_flags 0
00000001:00000040:13.0:1366734926.898375:0:4501:0:(stap_31938_src.c:842:function_mdd_readpage_return()) LU-3029: rdpg = ffff880403b28b28, hash = 1255106162, attr = 7, count = 4096, npages = 1
00000001:00000040:13.0:1366734926.898376:0:4501:0:(stap_31938_src.c:852:function_mdd_readpage_return()) LU-3029: rdpg: ffff880403b28b28, lu_dirpage ldp_hash_start 1255106162 ldp_hash_end 1502393138 ldp_flags 0
00000100:00100000:15.0:1366734926.898804:0:4501:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_21:119d1f74-b5ac-9653-7a11-c72daccd58b7+176:1767:x1432697831399328:12345-172.16.56.1@tcp1:37 Request procesed in 4136us (4887us total) trans 0 rc 0/0

All good, now lets look at the same information for the two RPCs where the client is asking for hash 1502393138.

"good" case:

                                                                      
00000100:00100000:4.0:1366671762.784957:0:4513:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_33:119d1f74-b5ac-9653-7a11-c72daccd58b7+11:5521:x1432697691423684:12345-172.16.56.1@tcp1:37
00000001:00000040:4.0:1366671762.784969:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000004:00000040:4.0:1366671762.788669:0:4513:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 1502393138 ldp_hash_end 1713543192 ldp_flags 0
00000001:00000040:4.0:1366671762.788682:0:4513:0:(stap_27848_src.c:842:function_mdd_readpage_return()) LU-3029: rdpg = ffff88082cd89b28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000001:00000040:4.0:1366671762.788684:0:4513:0:(stap_27848_src.c:852:function_mdd_readpage_return()) LU-3029: rdpg: ffff88082cd89b28, lu_dirpage ldp_hash_start 1502393138 ldp_hash_end 1713543192 ldp_flags 0
00000100:00100000:4.0:1366671762.789180:0:4513:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_33:119d1f74-b5ac-9653-7a11-c72daccd58b7+10:5521:x1432697691423684:12345-172.16.56.1@tcp1:37 Request procesed in 4228us (4255us total) trans 0 rc 0/0

"bad" case:

                                                                      
00000100:00100000:12.0:1366734926.904700:0:31865:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_43:119d1f74-b5ac-9653-7a11-c72daccd58b7+177:1767:x1432697831399329:12345-172.16.56.1@tcp1:37
00000001:00000040:12.0:1366734926.904717:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000004:00000040:12.0:1366734926.908287:0:31865:0:(mdd_object.c:2365:mdd_dir_page_build()) lu_dirpage ldp_hash_start 4294967296 ldp_hash_end 972141167607355620 ldp_flags 0
00000001:00000040:12.0:1366734926.908306:0:31865:0:(stap_31938_src.c:842:function_mdd_readpage_return()) LU-3029: rdpg = ffff8801da30fb28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000001:00000040:12.0:1366734926.908308:0:31865:0:(stap_31938_src.c:852:function_mdd_readpage_return()) LU-3029: rdpg: ffff8801da30fb28, lu_dirpage ldp_hash_start 1502393138 ldp_hash_end 972141167607355620 ldp_flags 0
00000100:00100000:12.0:1366734926.908847:0:31865:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_43:119d1f74-b5ac-9653-7a11-c72daccd58b7+176:1767:x1432697831399329:12345-172.16.56.1@tcp1:37 Request procesed in 4158us (9365us total) trans 0 rc 0/0

Whoa, this is strange, the two RPCs don't looks the same at all here. In the "good" case a lu_dirpage is built up with hash_start of 1502393138, and hash_end of 1713543192. In the "bad" case the lu_dirpage is built up with hash_start of 4294967296, and hash_end of 972141167607355620.

So why is the "bad" case building a lu_dirpage with these "wrong" dirents and values for hash_start and hash_end? These bogus values are just used verbatim on the client even though they don't appear to be "correct".

Digging into the RPC asking for hash 1502393138 in some greater detail, it looks like the osd_it_ea_load call isn't loading the correct iterator information in the "bad" case. When "good" asks to load hash = 1502393138, the iterator is loaded with oied_off = 1502393138. But when bad attempts to load hash = 1502393138, it gets oied_off = 4294967296 (the hash_start values of it's lu_dirpage).

"good" case:

                                                                      
00000100:00100000:4.0:1366671762.784957:0:4513:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_33:119d1f74-b5ac-9653-7a11-c72daccd58b7+11:5521:x1432697691423684:12345-172.16.56.1@tcp1:37
00000001:00000040:4.0:1366671762.784969:0:4513:0:(stap_27848_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff88082cd89b28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000001:00000040:4.0:1366671762.784973:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_load entering
00000001:00000040:4.0:1366671762.784974:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 0, .oie_it_dirent = 0, .oie_dirent = (null), .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.784976:0:4513:0:(stap_27848_src.c:741:function_dinfo_osd_it_ea_dirent()) LU-3029: ea->oie_dirent == NULL
00000001:00000040:4.0:1366671762.785057:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6000) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466162723, .oied_off = 1502393138, .oied_namelen = 5, .oied_type = 8, .oied_name = 01319YZ }
00000001:00000040:4.0:1366671762.785059:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff8808341e6000, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785062:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_load leaving (rc = return=0x1)
00000001:00000040:4.0:1366671762.785064:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size entering
00000001:00000040:4.0:1366671762.785066:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size leaving (rc = return=0x5)
00000001:00000040:4.0:1366671762.785068:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store entering
00000001:00000040:4.0:1366671762.785070:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store leaving (rc = return=0x598cb332)
00000001:00000040:4.0:1366671762.785072:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec entering
00000001:00000040:4.0:1366671762.785073:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff8808341e6000, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785075:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6000) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466162723, .oied_off = 1502393138, .oied_namelen = 5, .oied_type = 8, .oied_name = 01319YZ }
00000001:00000040:4.0:1366671762.785082:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6000) = { .oied_fid = [0x5a59cc33e0:0x1c51a:0x0], .oied_ino = 466162723, .oied_off = 1502393138, .oied_namelen = 5, .oied_type = 8, .oied_name = 01319YZ }
00000001:00000040:4.0:1366671762.785084:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff8808341e6000, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785087:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec leaving (rc = return=0x0)
00000001:00000040:4.0:1366671762.785089:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next entering
00000001:00000040:4.0:1366671762.785089:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff8808341e6000, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785091:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6000) = { .oied_fid = [0x5a59cc33e0:0x1c51a:0x0], .oied_ino = 466162723, .oied_off = 1502393138, .oied_namelen = 5, .oied_type = 8, .oied_name = 01319YZ }
00000001:00000040:4.0:1366671762.785094:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6030) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466164047, .oied_off = 1502967108, .oied_namelen = 5, .oied_type = 8, .oied_name = 01907 }
00000001:00000040:4.0:1366671762.785096:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff8808341e6030, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785099:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next leaving (rc = return=0x0)
00000001:00000040:4.0:1366671762.785101:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size entering
00000001:00000040:4.0:1366671762.785102:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size leaving (rc = return=0x5)
00000001:00000040:4.0:1366671762.785104:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store entering
00000001:00000040:4.0:1366671762.785106:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store leaving (rc = return=0x59957544)
00000001:00000040:4.0:1366671762.785107:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec entering
00000001:00000040:4.0:1366671762.785108:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff8808341e6030, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785110:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6030) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466164047, .oied_off = 1502967108, .oied_namelen = 5, .oied_type = 8, .oied_name = 01907 }
00000001:00000040:4.0:1366671762.785117:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6030) = { .oied_fid = [0x5a59cc33e0:0x1c766:0x0], .oied_ino = 466164047, .oied_off = 1502967108, .oied_namelen = 5, .oied_type = 8, .oied_name = 01907 }
00000001:00000040:4.0:1366671762.785119:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff8808341e6030, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785122:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec leaving (rc = return=0x0)
00000001:00000040:4.0:1366671762.785124:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next entering
00000001:00000040:4.0:1366671762.785124:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff8808341e6030, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785126:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6030) = { .oied_fid = [0x5a59cc33e0:0x1c766:0x0], .oied_ino = 466164047, .oied_off = 1502967108, .oied_namelen = 5, .oied_type = 8, .oied_name = 01907 }
00000001:00000040:4.0:1366671762.785129:0:4513:0:(stap_27848_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88082cd8a400)->oie_dirent(ffff8808341e6060) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466163711, .oied_off = 1508584937, .oied_namelen = 5, .oied_type = 8, .oied_name = 01737 }
00000001:00000040:4.0:1366671762.785131:0:4513:0:(stap_27848_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88082cd8a400) = { .oie_obj = ffff88037ddc6d80, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 3, .oie_dirent = ffff8808341e6060, .oie_buf = ffff8808341e6000 }
00000001:00000040:4.0:1366671762.785134:0:4513:0:(stap_27848_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next leaving (rc = return=0x0)

"bad" case:

                                                                      
00000100:00100000:12.0:1366734926.904700:0:31865:0:(service.c:1724:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt_rdpg_43:119d1f74-b5ac-9653-7a11-c72daccd58b7+177:1767:x1432697831399329:12345-172.16.56.1@tcp1:37
00000001:00000040:12.0:1366734926.904717:0:31865:0:(stap_31938_src.c:800:function_mdd_readpage_call()) LU-3029: rdpg = ffff8801da30fb28, hash = 1502393138, attr = 7, count = 4096, npages = 1
00000001:00000040:12.0:1366734926.904726:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_load entering
00000001:00000040:12.0:1366734926.904728:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 0, .oie_it_dirent = 0, .oie_dirent = (null), .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904730:0:31865:0:(stap_31938_src.c:741:function_dinfo_osd_it_ea_dirent()) LU-3029: ea->oie_dirent == NULL
00000001:00000040:12.0:1366734926.904831:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396000) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466160119, .oied_off = 4294967296, .oied_namelen = 2, .oied_type = 4, .oied_name = .. }
00000001:00000040:12.0:1366734926.904835:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff880229396000, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904838:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_load leaving (rc = return=0x1)
00000001:00000040:12.0:1366734926.904841:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size entering
00000001:00000040:12.0:1366734926.904843:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size leaving (rc = return=0x2)
00000001:00000040:12.0:1366734926.904845:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store entering
00000001:00000040:12.0:1366734926.904848:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store leaving (rc = return=0x100000000)
00000001:00000040:12.0:1366734926.904850:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec entering
00000001:00000040:12.0:1366734926.904851:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff880229396000, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904853:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396000) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466160119, .oied_off = 4294967296, .oied_namelen = 2, .oied_type = 4, .oied_name = .. }
00000001:00000040:12.0:1366734926.904860:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396000) = { .oied_fid = [0x5a59cc33e0:0x1c3d9:0x0], .oied_ino = 466160119, .oied_off = 4294967296, .oied_namelen = 2, .oied_type = 4, .oied_name = .. }
00000001:00000040:12.0:1366734926.904863:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff880229396000, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904865:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec leaving (rc = return=0x0)
00000001:00000040:12.0:1366734926.904867:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next entering
00000001:00000040:12.0:1366734926.904868:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 1, .oie_dirent = ffff880229396000, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904870:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396000) = { .oied_fid = [0x5a59cc33e0:0x1c3d9:0x0], .oied_ino = 466160119, .oied_off = 4294967296, .oied_namelen = 2, .oied_type = 4, .oied_name = .. }
00000001:00000040:12.0:1366734926.904873:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396028) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466164003, .oied_off = 950193342486109, .oied_namelen = 5, .oied_type = 8, .oied_name = 01877!
00000001:00000040:12.0:1366734926.904875:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff880229396028, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904877:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next leaving (rc = return=0x0)
00000001:00000040:12.0:1366734926.904879:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size entering
00000001:00000040:12.0:1366734926.904882:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_key_size leaving (rc = return=0x5)
00000001:00000040:12.0:1366734926.904883:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store entering
00000001:00000040:12.0:1366734926.904886:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_store leaving (rc = return=0x3603220a5965d)
00000001:00000040:12.0:1366734926.904887:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec entering
00000001:00000040:12.0:1366734926.904888:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff880229396028, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904890:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396028) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466164003, .oied_off = 950193342486109, .oied_namelen = 5, .oied_type = 8, .oied_name = 01877!
00000001:00000040:12.0:1366734926.904897:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396028) = { .oied_fid = [0x5a59cc33e0:0x1c748:0x0], .oied_ino = 466164003, .oied_off = 950193342486109, .oied_namelen = 5, .oied_type = 8, .oied_name = 01877!
00000001:00000040:12.0:1366734926.904899:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff880229396028, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904902:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_rec leaving (rc = return=0x0)
00000001:00000040:12.0:1366734926.904904:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next entering
00000001:00000040:12.0:1366734926.904904:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 2, .oie_dirent = ffff880229396028, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904906:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396028) = { .oied_fid = [0x5a59cc33e0:0x1c748:0x0], .oied_ino = 466164003, .oied_off = 950193342486109, .oied_namelen = 5, .oied_type = 8, .oied_name = 01877!
00000001:00000040:12.0:1366734926.904908:0:31865:0:(stap_31938_src.c:759:function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396058) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466162984, .oied_off = 1048734590608755, .oied_namelen = 5, .oied_type = 8, .oied_name = 01430 }
00000001:00000040:12.0:1366734926.904911:0:31865:0:(stap_31938_src.c:701:function_dinfo_osd_it_ea()) LU-3029: osd_it_ea(ffff88011cdee400) = { .oie_obj = ffff8803188fd300, .oie_file = {...}, .oie_rd_dirent = 106, .oie_it_dirent = 3, .oie_dirent = ffff880229396058, .oie_buf = ffff880229396000 }
00000001:00000040:12.0:1366734926.904913:0:31865:0:(stap_31938_src.c:649:function_dinfo()) LU-3029: osd_it_ea_next leaving (rc = return=0x0)
Comment by Ned Bass [ 23/Apr/13 ]

I noticed this ext4 patch appeared in kernel-2.6.32-343.el6. We first saw this bug after updating to an ldiskfs build with this change.

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d1f5273e9adb40724a85272f248f210dc4ce919a

I don't have any direct evidence to implicate it, but given the suspicious 32-bit vs. 64-bit nature of this bug, it looks worth investigating. In particular, the iterator looks like it's being "reset" to .. in the bad case:

function_dinfo_osd_it_ea_dirent()) LU-3029: osd_it_ea(ffff88011cdee400)->oie_dirent(ffff880229396000) = { .oied_fid = [0x0:0x0:0x0], .oied_ino = 466160119, .oied_off = 4294967296, .oied_namelen = 2, .oied_type = 4, .oied_name = ..

which made these hunks of the patch jump out at me:

/*@@ -441,13 +552,13 @@ static int ext4_dx_readdir(struct file *filp,
        int     ret;
 
        if (!info) {
-               info = ext4_htree_create_dir_info(filp->f_pos);
+               info = ext4_htree_create_dir_info(filp, filp->f_pos);
                if (!info)
                        return -ENOMEM;
                filp->private_data = info;
        }
 
-       if (filp->f_pos == EXT4_HTREE_EOF)
+       if (filp->f_pos == ext4_get_htree_eof(filp))
                return 0;       /* EOF */
 
        /* Some one has messed with f_pos; reset the world */
@@ -455,8 +566,8 @@ static int ext4_dx_readdir(struct file *filp,
                free_rb_tree_fname(&info->root);
                info->curr_node = NULL;
                info->extra_fname = NULL;
-               info->curr_hash = pos2maj_hash(filp->f_pos);
-               info->curr_minor_hash = pos2min_hash(filp->f_pos);
+               info->curr_hash = pos2maj_hash(filp, filp->f_pos);
+               info->curr_minor_hash = pos2min_hash(filp, filp->f_pos);
        }
 
        /*
Comment by Ned Bass [ 23/Apr/13 ]

I peeked at filp->f_mode in ldiskfs_readdir() with systemtap. It is always 0x600 on the server where we see this problem, so both FMODE_32BITHASH and FMODE_64BITHASH are set. On a server where we've never seen this bug, it is 0x400, i.e. only FMODE_64BITHASH is set. I'm not sure how it's getting set that way, but that explains why the major hash is always zero.

Comment by Ned Bass [ 24/Apr/13 ]

It turns out an earlier version of the patch I linked to was carried in the ldiskfs patch series for earlier kernels.

Comment by Ned Bass [ 24/Apr/13 ]

Reviewing this bit of code:

osd_handler.c
static struct dt_it *osd_it_ea_init(const struct lu_env *env,
3475                                     struct dt_object *dt,
3476                                     __u32 attr,
3477                                     struct lustre_capa *capa)
3478 {
3479         struct osd_object       *obj  = osd_dt_obj(dt);
3480         struct osd_thread_info  *info = osd_oti_get(env);
3481         struct osd_it_ea        *it   = &info->oti_it_ea;
3482         struct lu_object        *lo   = &dt->do_lu;
3483         struct dentry           *obj_dentry = &info->oti_it_dentry;
3484         ENTRY;
3485         LASSERT(lu_object_exists(lo));
3486
3487         obj_dentry->d_inode = obj->oo_inode;
3488         obj_dentry->d_sb = osd_sb(osd_obj2dev(obj));
3489         obj_dentry->d_name.hash = 0;
3490
3491         it->oie_rd_dirent       = 0;
3492         it->oie_it_dirent       = 0;
3493         it->oie_dirent          = NULL;
3494         it->oie_buf             = info->oti_it_ea_buf;
3495         it->oie_obj             = obj;
3496         it->oie_file.f_pos      = 0;
3497         it->oie_file.f_dentry   = obj_dentry;
3498         if (attr & LUDA_64BITHASH)
3499                 it->oie_file.f_mode |= FMODE_64BITHASH;
3500         else
3501                 it->oie_file.f_mode |= FMODE_32BITHASH;

Shouldn't lines 3499 and 3501 just use an assignment operator = rather than |=?

It seems the iterator context is reused between requests handled by the mdt_rdpg_xx threads, so once set the bits are never cleared. If a mix of 32- and 64- bit clients connect, we will end up with both FMODE_64BITHASH and FMODE_32BITHASH set, as we are seeing. (I don't think we have any 32-bit clients, so it's still a mystery why we're seeing that.)

Prior to this change it was an assignment:

http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c

This bug surfaced after we started running 2.1.4-4chaos which included that change. I don't think it introduced the bug, but it may have unmasked it.

Comment by nasf (Inactive) [ 24/Apr/13 ]

Agreed, the "|=" will be changed back to "=".

Comment by nasf (Inactive) [ 24/Apr/13 ]

The situation should be like this:

1) When you upgraded your MDS with the patch http://review.whamcloud.com/#patch,sidebyside,3467,7,lustre/osd-ldiskfs/osd_handler.c, the 64bithash/32bithash issue has been introduced in your system. Because the "osd_thread_info" is reused without totally reset when switch from one RPC processing to another RPC processing.

2) For old client, in spite of 32-bit or 64-bit, as long as it did NOT claim OBD_CONNECT_64BITHASH flags when connected to the MDS, the readdir RPC from such old client would cause that the "osd_thread_info::oti_it_ea::oie_file::f_mode" to be set as "FMODE_32BITHASH". As long as such readdir RPC happened once, the "FMODE_32BITHASH" flags on related "osd_thread_info" would not be cleared until the RPC service restarted on the MDS.

3) As long as the "FMODE_32BITHASH" was set, dir-hash processed by such RPC service thread would use the major hash only - 32bit. That is why we saw the 32bit dirhash returned.

4) No all the RPC service threads' "osd_thread_info::oti_it_ea::oie_file::f_mode" have been set as "FMODE_32BITHASH", depends on whether old clients triggered those RPC service threads to serve readdir RPCs or not. If the RPC service thread had not "FMODE_32BITHASH", then it will generate 64bithash, that is why we also saw some 64bit dirhash returned.

5) The readdir RPC from one client can be served by any RPC (readpage) service thread. So sometimes the readdir RPC was served by the RPC service thread which was set "FMODE_32BITHASH", but sometimes it may be served by the RPC service thread which was NOT set "FMODE_32BITHASH". For a large directory, one "ls -l dir" command may trigger several readdir RPCs, if these RPCs were handled by different RPC service threads, some of them were set "FMODE_32BITHASH" but some of them were NOT, then when client send 32bithash to the RPC service thread, which had NOT "FMODE_32BITHASH", the RPC service thread would explain the 32bithash (from client) as "major = 0, minor = 32bithash", that was wrong. So it cannot locate the right position.

6) For 2.x client, one readdir RPC will fetch back at most 256 pages, but for 1.8 client, only single page per RPC. So the readdir RPCs count for the same sized directory are different. And more readdir RPCs more failure possibility. That is why the failure is more easy to be reproduced on 1.8 client than on 2.x client.

Comment by Ned Bass [ 24/Apr/13 ]

nasf, thanks for the analysis. It does seem plausible, except that our oldest client is lustre-1.8.5.0-6chaos, and it seems to claim OBD_CONNECT_64BITHASH:

https://github.com/chaos/lustre/blob/1.8.5.0-6chaos/lustre/llite/llite_lib.c#L284

Another unsubstantiated theory is that a swabbing bug could make a PPC client look like 32-bit. But, I checked the import file in /proc for a PPC client and it said 64bithash.

The only other idea I have is maybe there is some race checking the exp_connect_flags here:

https://github.com/chaos/lustre/blob/2.1.4-4chaos/lustre/mdt/mdt_handler.c#L1488

Comment by nasf (Inactive) [ 24/Apr/13 ]

The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

As for the race checking, I am not sure. Since it is race, it should be rare. So means most RPC service threads should NOT set "FMODE_32BITHASH". But the results does not support such conclusion. On the other hand, if the "race" exist, it is possible for me to reproduce it locally. But I have tested with your branches thousand times, never reproduced. So the "race" is so rare...

Comment by Ned Bass [ 24/Apr/13 ]

The old client means it was ever old, for example, the client upgrade from lustre-1.8.x to lustre-1.8.5.0-6chao. If the client ever triggered readdir RPC before the upgrading, then it would cause above issues.

The 1.8 clients have been running the same version since long before the MDS was last rebooted.

Comment by nasf (Inactive) [ 24/Apr/13 ]

Were there any liblustre clients or old b2_0 clients accessed your system since the last MDS reboot? They do not support 64-bit dirhash.

Comment by Ned Bass [ 24/Apr/13 ]

No, I don't believe so. The PPC clients are running 2.3.58, all other clients else should be 2.1.x or 1.8.5.

Comment by nasf (Inactive) [ 24/Apr/13 ]

It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

Comment by Prakash Surya (Inactive) [ 24/Apr/13 ]
It is NOT true that all the lustre-1.8.5 support 64bithash. I have checked your branches and found that the oldest branch which supports 64bithash is lustre-1.8.5.0-6chaos. But the former version, such as lustre-1.8.5.0-{1/2/3/4/5}chaos, they all do NOT support 64bithash.

Nice catch! We do have some clients running a 1.8.5-5chaos tag (old BG/P systems), and interestingly enough, they are only mounting the two filesystems that we see this issue on. So it all seems to add up, IMO. The 1.8.5-5chaos clients appear to have tainted a subset of the MDS threads, causing 64bit enabled clients to see this issue when the readdir takes more than 1 RPC, and the readdir is serviced by a mix of tainted and not tainted MDS threads.

Comment by D. Marc Stearman (Inactive) [ 24/Apr/13 ]

We do have PPC64 clients on our BG/P systems that are stuck at 1.8. The udawn front end (login) nodes are running 1.8.5.0-5chaos. The IONs are running 1.8.5-8chaos.

Comment by nasf (Inactive) [ 24/Apr/13 ]

So please to apply the patch:

http://review.whamcloud.com/#change,6138

Comment by Christopher Morrone [ 24/Apr/13 ]

Actually, the BG/P systems are 32bit (at least the I/O nodes are)! BG/Q (Sequoia and Vulcan) are the first PPC systems that are fully 64-bit.

Comment by Ned Bass [ 25/Apr/13 ]

The patch should be backported to b2_1 as well.

Comment by nasf (Inactive) [ 26/Apr/13 ]

Patch for b2_1:

http://review.whamcloud.com/#change,6176

Comment by Peter Jones [ 20/May/13 ]

A version of this patch has landed for both 2.4.0 and 2.1.6. Has LLNL been able to confirm that this work does correct the issue? Are we able to mark this issue as resolved?

Comment by Ned Bass [ 20/May/13 ]

Peter, yes I believe we've had no further reports of this issue since we rolled out the patch. Marking resolved.

Comment by Peter Jones [ 20/May/13 ]

Excellent - thanks Ned!

Generated at Sat Feb 10 01:30:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.