[LU-6712] Hyperion IO error revalidate FID [0x20000040c:0x1f:0x0] error: rc = -5 Created: 11/Jun/15  Updated: 10/Oct/21  Resolved: 10/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Cliff White (Inactive) Assignee: Di Wang
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Hyerion SWL test


Attachments: Text File fid.list.txt     File iwc34.fini.txt.gz     Text File iws10.fldb.txt     Text File iws12.fldb.txt     Text File iws13.fldb.txt     Text File iws14.fldb.txt     Text File iws15.fldb.txt     Text File iws16.fldb.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Attempting to setup for SWL test - two clients out of the total have this issue:
Tests fail due to IO error when creating or listing directories
Example:

# ls /p/l_wham/white215/
ls: cannot access /p/l_wham/white215/mybob: Input/output error
ls: cannot access /p/l_wham/white215/SWL: Input/output error
SWL  foo  mybob

Client errors:

LustreError: 47371:0:(file.c:3081:ll_inode_revalidate_fini()) lustre: revalidate FID [0x20000040c:0x1f:0x0] error: rc = -5
LustreError: 47376:0:(file.c:3081:ll_inode_revalidate_fini()) lustre: revalidate FID [0x20000040c:0x1f:0x0] error: rc = -5
LustreError: 47376:0:(file.c:3081:ll_inode_revalidate_fini()) Skipped 1 previous similar message

Lustre dump from one client attached. Does not appear to be errors on MDS
Remounting the client appears to clear the issue.



 Comments   
Comment by Di Wang [ 12/Jun/15 ]

looks like FLD cache on the client has some problem

Client allocate the FID at MDT000a

40000000:00000040:3.0:1434056723.090873:0:55019:0:(fid_request.c:382:seq_client_alloc_fid()) cli-cli-lustre-MDT000a-mdc-ffff88085864d400: Allocated FID [0x114000040e:0x14:0x0]

Then in the following, it lookup this FID in FLD cache can get the FID in MDT000b

00800000:00000002:3.0:1434056723.090888:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #b for fid=[0x114000040e:0x14:0x0]

Cliff: Could you please do

lctl get_param fld.*MDT0000.fldb

on MDT0000 and post the result here. Thanks.

Comment by Cliff White (Inactive) [ 12/Jun/15 ]

results of lctl get_param fld.*MDT0000.fldb” on Hyperion MDT0

Comment by Di Wang [ 13/Jun/15 ]

According to the debug log and FLDB (MDT0), it is clearly the fldb cache on the client side is corrupted.

Client side cache

iwc34.fini.txt:00800000:00000002:3.0:1434056723.090390:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #0 for fid=[0x200000400:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090402:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #1 for fid=[0x1180000409:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090412:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #3 for fid=[0x1040000403:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090421:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #4 for fid=[0xf80000406:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090430:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #5 for fid=[0x1100000409:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090440:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #a for fid=[0xfc0000405:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090449:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #6 for fid=[0xf40000401:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090458:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #7 for fid=[0x11c0000402:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090467:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #9 for fid=[0x10c0000406:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090476:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #2 for fid=[0x1000000404:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090485:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #b for fid=[0x1140000407:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090494:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #8 for fid=[0x1080000402:0x84f:0x0]
iwc34.fini.txt:00800000:00000002:3.0:1434056723.090888:0:55019:0:(lmv_fld.c:79:lmv_fld_lookup()) FLD lookup got mds #b for fid=[0x114000040e:0x14:0x0]

On the server side

[0x0000000f40000400-0x0000000f80000400):6:mdt
[0x0000000f80000400-0x0000000fc0000400):1:mdt
[0x0000000fc0000400-0x0000001000000400):7:mdt
[0x0000001000000400-0x0000001040000400):2:mdt
[0x0000001040000400-0x0000001080000400):8:mdt
[0x0000001080000400-0x00000010c0000400):3:mdt
[0x00000010c0000400-0x0000001100000400):9:mdt
[0x0000001100000400-0x0000001140000400):4:mdt
[0x0000001140000400-0x0000001180000400):a:mdt
[0x0000001180000400-0x00000011c0000400):5:mdt
[0x00000011c0000400-0x0000001200000400):b:mdt

Unfortunately, I can not find the problem by checking the code.

Cliff: Could you please tell me
1. Anything special for hyperion clients? Only 1 filesystem right?
2. Could you please do "lctl get_param fld.*.fldb" on all of MDS and post the result here?
3. Is there recovery during SWL tests? or if SWL tests is not running? Could you please tell what did you run before this happens?

Comment by Cliff White (Inactive) [ 15/Jun/15 ]

Same data from all MDS

Generated at Sat Feb 10 02:02:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.