[LU-1829] filter_destroy_internal()) error unlinking objid after MDS recovery Created: 04/Sep/12  Updated: 07/Jan/16  Resolved: 07/Jan/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Cliff White (Inactive) Assignee: Hongchao Zhang
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

LLNL/Hyperion RHEL6 servers and clients - Lustre 2.2.94


Severity: 3
Rank (Obsolete): 10100

 Description   

Running recovery-scale, MDS completes a recovery, then error sequence occurs:

Sep  4 11:14:50 hyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0003_UUID now active, resetting orphans
Sep  4 11:25:35 hyperion-dit32 kernel: Lustre: lustre-OST0003: Received new MDS connection from 192.168.127.6@o2ib1, removing former export from same NID

Sep  4 11:26:22 hyperion-dit32 kernel: Lustre: DEBUG MARKER: mds1 has failed over 7 times, and counting...
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: lustre-OST0003: received MDS connection from 192.168.127.6@o2ib1
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(lustre_log.h:474:llog_group_set_export()) lustre-OST0003: export for group 0 is changed: 0xffff88032d63a000 -> 0xffff88032d734000
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(lustre_log.h:474:llog_group_set_export()) Skipped 15 previous similar messages
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff88019a508800 - ffff880196a89800
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(llog_net.c:162:llog_receptor_accept()) Skipped 15 previous similar messages
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: Skipped 6 previous similar messages
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(lustre_log.h:474:llog_group_set_export()) lustre-OST000b: export for group 0 is changed: 0xffff88032d6f5400 -> 0xffff8801d9606800
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(lustre_log.h:474:llog_group_set_export()) Skipped 13 previous similar messages
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff880198e6b800 - ffff8801bd92e800
Sep  4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(llog_net.c:162:llog_receptor_accept()) Skipped 13 previous similar messages
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1627:filter_destroy_internal()) destroying objid 10897 ino 72359988 nlink 0 count 2
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1627:filter_destroy_internal()) Skipped 3 previous similar messages
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1633:filter_destroy_internal()) error unlinking objid 10897: rc -2
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1633:filter_destroy_internal()) Skipped 3 previous similar messages
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1627:filter_destroy_internal()) destroying objid 10833 ino 7929909 nlink 0 count 2
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1627:filter_destroy_internal()) Skipped 2 previous similar messages
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1633:filter_destroy_internal()) error unlinking objid 10833: rc -2
Sep  4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1633:filter_destroy_internal()) Skipped 1 previous similar message


 Comments   
Comment by Peter Jones [ 04/Sep/12 ]

Hongchao

Could you please look into this one?

Thanks

Peter

Comment by Hongchao Zhang [ 05/Sep/12 ]

the Inode has been unlinked (nlink == 0, and the following error is -2/-ENOENT) before filter_destroy_internal destroys it.

one possible case is the filter_destroy unlinked the inode, then filter_destroy_precreated try to unlink the same one
during MDT restoring connection with this OST.

Hi Cliff, is the debug log for this issue available?

Comment by Cliff White (Inactive) [ 05/Sep/12 ]

I did not get one, i can re-run the test and do that.

Comment by Cliff White (Inactive) [ 06/Sep/12 ]

Moved to LU-1872

Comment by Cliff White (Inactive) [ 10/Sep/12 ]

Ran a further 12 hours of recovery-scale, and 48 hours of SWL. Error did not reproduce, not sure why.

Comment by Cliff White (Inactive) [ 10/Sep/12 ]

Soft lockup issue moved to lu-1872

Comment by Peter Jones [ 10/Sep/12 ]

Dropping in priority because does not reproduce

Comment by John Fuchs-Chesney (Inactive) [ 07/Jan/16 ]

Resolving as 'cannot reproduce'
~ jfc.

Generated at Sat Feb 10 01:20:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.