Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.1.4
-
None
-
client and server: lustre-2.1.4-5chaos_2.6.32_358.6.1.3chaos.ch5.1.x86_64.x86_64
-
3
-
8853
Description
We have found two directories on our lscratchc filesystem for which getdents() returns EIO. This error occurs on all clients and across reboots/remounts. Analysis below suggests a client-side page caching problem. No MDS_READPAGE RPC is ever sent.
# cab14 /root > strace -e trace=getdents ls /p/lscratchc/hood/Ag/s_local/fcc getdents(3, 0x61fcc8, 32768) = -1 EIO (Input/output error) ls: reading directory /p/lscratchc/hood/Ag/s_local/fcc: Input/output error # cab14 /root > dmesg | tail LustreError: 123961:0:(dir.c:477:ll_get_dir_page()) read cache page: [0x38353ea534c:0x1:0x0] at 0: rc -5 LustreError: 123961:0:(dir.c:648:ll_readdir()) error reading dir [0x38353ea534c:0x1:0x0] at 0: rc -5
No errors appear on the servers.
With +vfstrace +trace debugging on the client we get:
00000002:00000001:7.0:1370559009.547576:0:20012:0:(mdc_locks.c:159:mdc_set_lock_data()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:7.0:1370559009.547577:0:20012:0:(obd_class.h:2119:md_set_lock_data()) Process leaving (rc=0 : 0 : 0) 00000080:00020000:7.0:1370559009.547581:0:20012:0:(dir.c:438:ll_get_dir_page()) dir page locate: [0x38353ea534c:0x1:0x0] at 0: rc -5 00000080:00000001:7.0:1370559009.557292:0:20012:0:(dir.c:439:ll_get_dir_page()) Process leaving via out_unlock (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)
Corresponding code in ll_get_dir_page():
426 } else { 427 /* for cross-ref object, l_ast_data of the lock may not be set, 428 * we reset it here */ 429 md_set_lock_data(ll_i2sbi(dir)->ll_md_exp, &lockh.cookie, 430 dir, NULL); 431 } 432 ldlm_lock_dump_handle(D_OTHER, &lockh); 433 434 cfs_down(&lli->lli_readdir_sem); 435 page = ll_dir_page_locate(dir, &lhash, &start, &end); 436 if (IS_ERR(page)) { 437 CERROR("dir page locate: "DFID" at "LPU64": rc %ld\n", 438 PFID(ll_inode2fid(dir)), lhash, PTR_ERR(page)); 439 GOTO(out_unlock, page); 440 }
Looking at the source for ll_dir_page_locate(),
340 if (PageUptodate(page)) { ... 352 CDEBUG(D_VFSTRACE, "page %lu [%llu %llu], hash "LPU64"\n", ... 369 } else { 370 page_cache_release(page); 371 page = ERR_PTR(-EIO); 372 }
It looks like the page that it finds using radix_tree_gang_lookup() fails PageUptodate(page) which results in EIO being returned.
The FIDs of the two directories are [0x6da716ef142:0x1:0x0] and [0x38353ea534c:0x1:0x0]. Note that we have burned through a huge number of FID sequences on this filesystem due to LU-1632. I wonder if we've stumbled on a few "magic" numbers that expose a strange corner-case in the hashing code.
LLNL-bug-ID: TOSS-2026
Attachments
Issue Links
- duplicates
-
LU-2627 /bin/ls gets Input/output error
- Resolved