[LU-11857] repeated "could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2" messages Created: 14/Jan/19  Updated: 16/Feb/19  Resolved: 16/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jeff Johnson (Inactive) Assignee: Alex Zhuravlev
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File orph_100pct_201901142137.txt.bz2    
Issue Links:
Related
is related to LU-11418 hung threads on MDT and MDT won't umount Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After upgrade from 2.9 to 2.12 the MDS syslog gets beaucoup error messages lfs-MDD0000 cannot delete orphan messages. On the active MDS the orph_lfs-MDD00 process is pegged at 100%

[Sat Jan 12 23:21:18 2019] LustreError: 14125:0:(mdd_orphans.c:327:mdd_orphan_destroy()) lfs-MDD0000: could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2
[Sat Jan 12 23:21:18 2019] LustreError: 14125:0:(mdd_orphans.c:327:mdd_orphan_destroy()) Skipped 8067628 previous similar messages
[Sat Jan 12 23:31:18 2019] LustreError: 14125:0:(mdd_orphans.c:327:mdd_orphan_destroy()) lfs-MDD0000: could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2
[Sat Jan 12 23:31:18 2019] LustreError: 14125:0:(mdd_orphans.c:327:mdd_orphan_destroy()) Skipped 7958773 previous similar messages
[Sat Jan 12 23:41:18 2019] LustreError: 14125:0:(mdd_orphans.c:327:mdd_orphan_destroy()) lfs-MDD0000: could not delete orphan [0x200060151:0x38a8:0x0]: rc = -2


 Comments   
Comment by Peter Jones [ 14/Jan/19 ]

Alex

Could you please investigate?

Thanks

Peter

Comment by Jeff Johnson (Inactive) [ 14/Jan/19 ]

Update:  Performed a full shutdown of the file system and all server systems.  Rebooted and performed orderly start of the file system.

The reported orph_lfs-MDD00 thread at 100% CPU and

syslog entries of 

LustreError: 109872:0:(mdd_orphans.c:327:mdd_orphan_destroy()) ls15-MDD0000: could not delete orphan [0x200060151:0x38a8:0x0] 

continue to occur. After 15 minutes of wall time with the file system targets mounted the errors persist.

Let me know if you want me to upload another dump of the debug buffer.

Comment by Jeff Johnson (Inactive) [ 14/Jan/19 ]

In 13 minutes the mdd_orphans.c:327:mdd_orphan_destroy have incremented quantity from 1979094 to 8056841  (6077747 syslog msg repeats in 13 minutes).

 

Comment by Jeff Johnson (Inactive) [ 15/Jan/19 ]

After seven hours the orph_lfs-MDD00 thread still pegged at 100%

Attaching orph_100pct_201901142137.txt.bz2

Comment by Alex Zhuravlev [ 15/Jan/19 ]

I think https://review.whamcloud.com/#/c/33661/ should help

Comment by Alex Zhuravlev [ 16/Jan/19 ]

aeonjeffjcan you please try https://review.whamcloud.com/#/c/33661/ ?

Comment by Jeff Johnson (Inactive) [ 16/Jan/19 ]

I can. This is a production LFS. Given that there is data in place, should I? Not arguing, just applying caution and respect for end user's data.

Comment by Andreas Dilger [ 17/Jan/19 ]

The patch has passed review and testing and is scheduled to land in 2.13 shortly. This should avoid the repeated attempts to destroy the same object.

Comment by Peter Jones [ 27/Jan/19 ]

So, it seems like this is believed to be a duplicate of the recently landed LU-11418 fix. Jeff, if this issue is not currently causing MSU any heartburn and explicitly trying to prove/disprove whether the theory is sound would be disruptive, is it enough to close this ticket as a duplicate of LU-11418 and reopen it if it is seen on a release including the fix (2.13, or an upcoming 2.12.x maintenance releases)

Comment by Peter Jones [ 16/Feb/19 ]

No objections it seems

Generated at Sat Feb 10 02:47:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.