[LU-17170] Likely at unlink: many LustreError: mdt_open.c:1217:mdt_cross_open() fsname-MDTxxxx: [FID] doesn't exist!: rc = -14 Created: 05/Oct/23  Updated: 24/Oct/23  Resolved: 06/Oct/23

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Stephane Thiell Assignee: WC Triage
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 7.9 kernel 3.10.0-1160.90.1.el7_lustre.pl1.x86_64


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With 2.15.3 on Sherlock's scratch filesystem (Fir), we are seeing a LOT of the following messages on all four MDTs when files are being purged by Robinhood:

# clush -w @mds -L "journalctl -n 10 -k | grep LustreError"
fir-md1-s1: Oct 05 15:30:56 fir-md1-s1 kernel: LustreError: 32843:0:(mdt_open.c:1570:mdt_reint_open()) fir-MDT0000: name '[0x20005b5ae:0x14198:0x0]' present, but FID [0x20005b5ae:0x14198:0x0] is invalid
fir-md1-s1: Oct 05 15:31:45 fir-md1-s1 kernel: LustreError: 51313:0:(mdt_open.c:1570:mdt_reint_open()) fir-MDT0000: name '[0x20005b5b5:0x1dd34:0x0]' present, but FID [0x20005b5b5:0x1dd34:0x0] is invalid
fir-md1-s1: Oct 05 15:33:22 fir-md1-s1 kernel: LustreError: 32959:0:(mdt_open.c:1570:mdt_reint_open()) fir-MDT0000: name '[0x20005b5cf:0xff79:0x0]' present, but FID [0x20005b5cf:0xff79:0x0] is invalid
fir-md1-s2: Oct 05 15:35:57 fir-md1-s2 kernel: LustreError: 125135:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0001: [0x24007e440:0x83be:0x0] doesn't exist!: rc = -14
fir-md1-s2: Oct 05 15:35:57 fir-md1-s2 kernel: LustreError: 125135:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 605 previous similar messages
fir-md1-s2: Oct 05 15:36:06 fir-md1-s2 kernel: LustreError: 125409:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0001: [0x24007e440:0x88ad:0x0] doesn't exist!: rc = -14
fir-md1-s2: Oct 05 15:36:06 fir-md1-s2 kernel: LustreError: 125409:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 1256 previous similar messages
fir-md1-s2: Oct 05 15:36:25 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0001: [0x24007e440:0x92bd:0x0] doesn't exist!: rc = -14
fir-md1-s2: Oct 05 15:36:25 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 3743 previous similar messages
fir-md1-s2: Oct 05 15:37:03 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0001: [0x24007e50d:0x15e22:0x0] doesn't exist!: rc = -14
fir-md1-s2: Oct 05 15:37:03 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 8438 previous similar messages
fir-md1-s2: Oct 05 15:38:18 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0001: [0x24007e50e:0x13804:0x0] doesn't exist!: rc = -14
fir-md1-s2: Oct 05 15:38:18 fir-md1-s2 kernel: LustreError: 125341:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 16783 previous similar messages
fir-md1-s3: Oct 05 15:01:52 fir-md1-s3 kernel: LustreError: 14993:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0002: [0x2c006c67d:0x2a0b:0x0] doesn't exist!: rc = -14
fir-md1-s3: Oct 05 15:01:52 fir-md1-s3 kernel: LustreError: 14993:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 18907 previous similar messages
fir-md1-s3: Oct 05 15:17:31 fir-md1-s3 kernel: LustreError: 12198:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0002: [0x2c006c67d:0x2950:0x0] doesn't exist!: rc = -14
fir-md1-s3: Oct 05 15:17:31 fir-md1-s3 kernel: LustreError: 12198:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 19208 previous similar messages
fir-md1-s3: Oct 05 15:46:14 fir-md1-s3 kernel: LustreError: 65665:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0002: [0x2c006c606:0x524d:0x0] doesn't exist!: rc = -14
fir-md1-s3: Oct 05 15:46:14 fir-md1-s3 kernel: LustreError: 65665:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 49094 previous similar messages
fir-md1-s3: Oct 05 15:47:29 fir-md1-s3 kernel: LustreError: 12352:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0002: [0x2c006c65c:0x145df:0x0] doesn't exist!: rc = -14
fir-md1-s3: Oct 05 15:47:29 fir-md1-s3 kernel: LustreError: 12352:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 12772 previous similar messages
fir-md1-s3: Oct 05 15:49:59 fir-md1-s3 kernel: LustreError: 14987:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0002: [0x2c006c710:0x15304:0x0] doesn't exist!: rc = -14
fir-md1-s3: Oct 05 15:49:59 fir-md1-s3 kernel: LustreError: 14987:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 32807 previous similar messages
fir-md1-s4: Oct 05 15:39:54 fir-md1-s4 kernel: LustreError: 23103:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0003: [0x280067e5f:0x1c1ab:0x0] doesn't exist!: rc = -14
fir-md1-s4: Oct 05 15:39:54 fir-md1-s4 kernel: LustreError: 23103:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 19686 previous similar messages
fir-md1-s4: Oct 05 15:40:10 fir-md1-s4 kernel: LustreError: 23395:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0003: [0x28006d889:0x18767:0x0] doesn't exist!: rc = -14
fir-md1-s4: Oct 05 15:40:10 fir-md1-s4 kernel: LustreError: 23395:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 2687 previous similar messages
fir-md1-s4: Oct 05 15:40:42 fir-md1-s4 kernel: LustreError: 23445:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0003: [0x28006d889:0x195c0:0x0] doesn't exist!: rc = -14
fir-md1-s4: Oct 05 15:40:42 fir-md1-s4 kernel: LustreError: 23445:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 6453 previous similar messages
fir-md1-s4: Oct 05 15:41:46 fir-md1-s4 kernel: LustreError: 23017:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0003: [0x28006d889:0x1cf16:0x0] doesn't exist!: rc = -14
fir-md1-s4: Oct 05 15:41:46 fir-md1-s4 kernel: LustreError: 23017:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 15651 previous similar messages
fir-md1-s4: Oct 05 15:43:54 fir-md1-s4 kernel: LustreError: 23367:0:(mdt_open.c:1217:mdt_cross_open()) fir-MDT0003: [0x28006daa0:0xd855:0x0] doesn't exist!: rc = -14
fir-md1-s4: Oct 05 15:43:54 fir-md1-s4 kernel: LustreError: 23367:0:(mdt_open.c:1217:mdt_cross_open()) Skipped 23918 previous similar messages

However, these errors seem to be harmless, at least we have not been able to find any problem so far. We have verified that those FIDs are files being automatically unlinked by Robinhood (we purge after 90 days) and the LustreError are happening at the same second than the unlink.

 



 Comments   
Comment by Stephane Thiell [ 06/Oct/23 ]

I am going to close this, as it is not a Lustre issue. We had a misconfiguration where multiple Robinhood instances where not distributed correctly and were deleting the same set of files at the same time (at a very high rate). Lustre was a bit verbose in that case but reported a useful information. Accessing deleted files by FID returns "Bad address" (-14) and not "Not such file or directory" (-2) when accessed by FID as root (the program that we use with Robinhood does that).

[root@fir-rbh06 robinhood]# cat '/fir/.lustre/fid/[0x28006db3c:0x9bbb:0x0]'
cat: /fir/.lustre/fid/[0x28006db3c:0x9bbb:0x0]: Bad address
Comment by Gerrit Updater [ 11/Oct/23 ]

"Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52630
Subject: LU-17170 tests: check the system is clean
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f3f2332b4621052652a9d0f986e5ff55c94ba9ad

Comment by Sergey Cheremencev [ 24/Oct/23 ]

 "Sergey Cheremencev <scherementsev@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/52630

Placed here accidentally. The patch is aimed for LU-17179.

Generated at Sat Feb 10 03:33:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.