[LU-1829] filter_destroy_internal()) error unlinking objid after MDS recovery Created: 04/Sep/12 Updated: 07/Jan/16 Resolved: 07/Jan/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Cliff White (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
LLNL/Hyperion RHEL6 servers and clients - Lustre 2.2.94 |
||
| Severity: | 3 |
| Rank (Obsolete): | 10100 |
| Description |
|
Running recovery-scale, MDS completes a recovery, then error sequence occurs: Sep 4 11:14:50 hyperion-rst6 kernel: Lustre: MDS mdd_obd-lustre-MDT0000: lustre-OST0003_UUID now active, resetting orphans Sep 4 11:25:35 hyperion-dit32 kernel: Lustre: lustre-OST0003: Received new MDS connection from 192.168.127.6@o2ib1, removing former export from same NID Sep 4 11:26:22 hyperion-dit32 kernel: Lustre: DEBUG MARKER: mds1 has failed over 7 times, and counting... Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: lustre-OST0003: received MDS connection from 192.168.127.6@o2ib1 Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(lustre_log.h:474:llog_group_set_export()) lustre-OST0003: export for group 0 is changed: 0xffff88032d63a000 -> 0xffff88032d734000 Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(lustre_log.h:474:llog_group_set_export()) Skipped 15 previous similar messages Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff88019a508800 - ffff880196a89800 Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6082:0:(llog_net.c:162:llog_receptor_accept()) Skipped 15 previous similar messages Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: Skipped 6 previous similar messages Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(lustre_log.h:474:llog_group_set_export()) lustre-OST000b: export for group 0 is changed: 0xffff88032d6f5400 -> 0xffff8801d9606800 Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(lustre_log.h:474:llog_group_set_export()) Skipped 13 previous similar messages Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(llog_net.c:162:llog_receptor_accept()) changing the import ffff880198e6b800 - ffff8801bd92e800 Sep 4 11:26:34 hyperion-dit32 kernel: Lustre: 6070:0:(llog_net.c:162:llog_receptor_accept()) Skipped 13 previous similar messages Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1627:filter_destroy_internal()) destroying objid 10897 ino 72359988 nlink 0 count 2 Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1627:filter_destroy_internal()) Skipped 3 previous similar messages Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1633:filter_destroy_internal()) error unlinking objid 10897: rc -2 Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 12031:0:(filter.c:1633:filter_destroy_internal()) Skipped 3 previous similar messages Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1627:filter_destroy_internal()) destroying objid 10833 ino 7929909 nlink 0 count 2 Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1627:filter_destroy_internal()) Skipped 2 previous similar messages Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1633:filter_destroy_internal()) error unlinking objid 10833: rc -2 Sep 4 11:26:37 hyperion-dit32 kernel: LustreError: 13471:0:(filter.c:1633:filter_destroy_internal()) Skipped 1 previous similar message |
| Comments |
| Comment by Peter Jones [ 04/Sep/12 ] |
|
Hongchao Could you please look into this one? Thanks Peter |
| Comment by Hongchao Zhang [ 05/Sep/12 ] |
|
the Inode has been unlinked (nlink == 0, and the following error is -2/-ENOENT) before filter_destroy_internal destroys it. one possible case is the filter_destroy unlinked the inode, then filter_destroy_precreated try to unlink the same one Hi Cliff, is the debug log for this issue available? |
| Comment by Cliff White (Inactive) [ 05/Sep/12 ] |
|
I did not get one, i can re-run the test and do that. |
| Comment by Cliff White (Inactive) [ 06/Sep/12 ] |
|
Moved to |
| Comment by Cliff White (Inactive) [ 10/Sep/12 ] |
|
Ran a further 12 hours of recovery-scale, and 48 hours of SWL. Error did not reproduce, not sure why. |
| Comment by Cliff White (Inactive) [ 10/Sep/12 ] |
|
Soft lockup issue moved to lu-1872 |
| Comment by Peter Jones [ 10/Sep/12 ] |
|
Dropping in priority because does not reproduce |
| Comment by John Fuchs-Chesney (Inactive) [ 07/Jan/16 ] |
|
Resolving as 'cannot reproduce' |