Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.1.4
Labels:
None
Environment:
client and server: lustre-2.1.4-5chaos_2.6.32_358.6.1.3chaos.ch5.1.x86_64.x86_64

Severity:
3
Rank (Obsolete):
8853

Description

We have found two directories on our lscratchc filesystem for which getdents() returns EIO. This error occurs on all clients and across reboots/remounts. Analysis below suggests a client-side page caching problem. No MDS_READPAGE RPC is ever sent.

# cab14 /root > strace -e trace=getdents ls /p/lscratchc/hood/Ag/s_local/fcc
getdents(3, 0x61fcc8, 32768)            = -1 EIO (Input/output error)
ls: reading directory /p/lscratchc/hood/Ag/s_local/fcc: Input/output error
# cab14 /root > dmesg | tail
LustreError: 123961:0:(dir.c:477:ll_get_dir_page()) read cache page: [0x38353ea534c:0x1:0x0] at 0: rc -5
LustreError: 123961:0:(dir.c:648:ll_readdir()) error reading dir [0x38353ea534c:0x1:0x0] at 0: rc -5

No errors appear on the servers.

With +vfstrace +trace debugging on the client we get:

00000002:00000001:7.0:1370559009.547576:0:20012:0:(mdc_locks.c:159:mdc_set_lock_data()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:7.0:1370559009.547577:0:20012:0:(obd_class.h:2119:md_set_lock_data()) Process leaving (rc=0 : 0 : 0)
00000080:00020000:7.0:1370559009.547581:0:20012:0:(dir.c:438:ll_get_dir_page()) dir page locate: [0x38353ea534c:0x1:0x0] at 0: rc -5
00000080:00000001:7.0:1370559009.557292:0:20012:0:(dir.c:439:ll_get_dir_page()) Process leaving via out_unlock (rc=18446744073709551611 : -5 : 0xfffffffffffffffb)

Corresponding code in ll_get_dir_page():

 426         } else {                                                                
 427                 /* for cross-ref object, l_ast_data of the lock may not be set, 
 428                  * we reset it here */                                          
 429                 md_set_lock_data(ll_i2sbi(dir)->ll_md_exp, &lockh.cookie,       
 430                                  dir, NULL);                                    
 431         }                                                                       
 432         ldlm_lock_dump_handle(D_OTHER, &lockh);                                 
 433                                                                                 
 434         cfs_down(&lli->lli_readdir_sem);                                        
 435         page = ll_dir_page_locate(dir, &lhash, &start, &end);                   
 436         if (IS_ERR(page)) {                                                     
 437                 CERROR("dir page locate: "DFID" at "LPU64": rc %ld\n",          
 438                        PFID(ll_inode2fid(dir)), lhash, PTR_ERR(page));          
 439                 GOTO(out_unlock, page);                                         
 440         }

Looking at the source for ll_dir_page_locate(),

                                                                      
 340                 if (PageUptodate(page)) {                                       
...                                                                             
 352                         CDEBUG(D_VFSTRACE, "page %lu [%llu %llu], hash "LPU64"\n",
...                                                                             
 369                 } else {                                                        
 370                         page_cache_release(page);                               
 371                         page = ERR_PTR(-EIO);                                   
 372                 }

It looks like the page that it finds using radix_tree_gang_lookup() fails PageUptodate(page) which results in EIO being returned.

The FIDs of the two directories are [0x6da716ef142:0x1:0x0] and [0x38353ea534c:0x1:0x0]. Note that we have burned through a huge number of FID sequences on this filesystem due to ~~LU-1632~~. I wonder if we've stumbled on a few "magic" numbers that expose a strange corner-case in the hashing code.

LLNL-bug-ID: TOSS-2026

Attachments

Issue Links

duplicates

LU-2627 /bin/ls gets Input/output error

Resolved

Activity

People

Assignee:: Di Wang (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Jun/13 11:12 PM

Updated:: 03/Jul/13 9:04 PM

Resolved:: 03/Jul/13 9:04 PM