[LU-1322] 1.8 client hang with 1.8.4 server Created: 13/Apr/12  Updated: 22/Feb/13  Resolved: 04/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6, Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: Lustre 1.8.9

Type: Bug Priority: Major
Reporter: Peng Tao Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: emc, patch
Environment:

CentOS 5 with 1.8.6-WC1 clients and 1.8.4 servers.


Severity: 3
Epic: interoperability
Rank (Obsolete): 4028

 Description   

During running some tests, we found client hang during tests. Further investigation shows that it is because client is looping for ever in ll_readdir_page(). The first ll_dir_dentry is correct but some of following up ll_dir_dentry record is all NULL.

crash> struct ll_dir_entry 0xffff8105f5818000
struct ll_dir_entry {
lde_inode = 748257583,
lde_rec_len = 12,
lde_name_len = 1 '\001',
lde_file_type = 2 '\002',
lde_name = ".\000\000\000\200\202\230,\f\000\002\002..\000\0000\201\231,\024\000\n\001ssciohb.nrat1\201\231,\020\000\005\001krsni8552\201\231,\020\000\005\001nticpemc3\201\231,\020\000\b\001crn.wole4\201\231,\f\000\003\001lsrt5\201\231,\024\000\n\001loita.hdal2.6\201\231,\024\000\n\001feekg.vsri9\0007\201\231,\024\000\f\001lgaumt.ggesd8\201\231,\024\000\v\001eilltn.ncsr.9\201\231,\020\000\b\001sai.blol:\201\231,\024\000\t\001rnbtmru.sing;\201\231,\020\000\006\001eta.fo64<\201\231,\f\000\003\001rkr.=\201\231,\020\000\005\001aco.d13"
}

crash> struct ll_dir_entry 0xffff8105f5818a19
struct ll_dir_entry {
lde_inode = 0,
lde_rec_len = 0,
lde_name_len = 0 '\0',
lde_file_type = 0 '\0',
lde_name
}

After applying bellow changes, tests passes smoothly and the debug message is printed a lot.

diff --git a/lustre/llite/dir.c b/lustre/llite/dir.c
index 3154d32..3b9779b 100644
— a/lustre/llite/dir.c
+++ b/lustre/llite/dir.c
@@ -327,6 +327,12 @@ static int ll_readdir_page(char *addr, __u64 base, unsigned *offset,
de = ll_entry_at(addr, *offset);
end = addr + CFS_PAGE_SIZE - ll_dir_rec_len(1);
for (nr = 0 ;(char*)de <= end; de = ll_dir_next_entry(de)) {
+ if (de->lde_rec_len == 0)

{ + printk("bergwolf debug\n"); + printk("de %p lde_inode %d lde_rec_len %d lde_name_len %d lde_file_type %d\n", + de, de->lde_inode, de->lde_rec_len, de->lde_name_len, de->lde_file_type); + break; + }

if (de->lde_inode != 0) {
nr++;
*offset = (char *)de - addr;

It may not be the right fix as I didn't figure out why the page is partially zeroed.



 Comments   
Comment by Peng Tao [ 25/Jun/12 ]

This is also reproduced with 1.8.8-WC1 clients.

From a generated kernel dump, it appears that there are only three valid dentries in the page, starting from offset 0, 12, 24. However, in ll_readdir(), kernel is asking to read a dentry from page offset 90 and therefore read into garbage data.

crash> struct ll_dir_entry 0xffff88051bd73000
struct ll_dir_entry {
lde_inode = 200449640,
lde_rec_len = 12,
lde_name_len = 1 '\001',
lde_file_type = 2 '\002',
lde_name = ".\000\000\000\001\000\362\v\f\000\002\002..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file
}
crash> struct ll_dir_entry 0xffff88051bd7300c
struct ll_dir_entry {
lde_inode = 200409089,
lde_rec_len = 12,
lde_name_len = 2 '\002',
lde_file_type = 2 '\002',
lde_name = "..\000\000i\236\362\v\350\017\r\001validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file
}
crash> struct ll_dir_entry 0xffff88051bd73018
struct ll_dir_entry {
lde_inode = 200449641,
lde_rec_len = 4072,
lde_name_len = 13 '\r',
lde_file_type = 1 '\001',
lde_name = "validate_file\000\000\000j\236\362\v\320\017\001\001\061alidate_file
}

Although it doesn't make much sense for application to seek randomly within dir page, Lustre should really work with the situation.

Comment by Peng Tao [ 25/Jun/12 ]

patch has been uploaded to http://review.whamcloud.com/#change,3181

Comment by Keith Mannthey (Inactive) [ 28/Nov/12 ]

This patch has been properly acked for acceptace but not much code is being taken into 1.8 at this point. As this is a Major bug I am sure it is still under consideration.

Comment by Keith Mannthey (Inactive) [ 04/Jan/13 ]

This code has been merged.

Generated at Sat Feb 10 01:15:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.