[LU-8967] directory entries for non existing files Created: 23/Dec/16 Updated: 10/Aug/17 Resolved: 27/Feb/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Ned Bass | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
ssh://review.whamcloud.com/fs/lustre-release-fe-llnl |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We have several directories with entries for non existing files. For example: [root@quartz2311:~]# ls -l /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0 ls: cannot access /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003: No such file or directory total 3154 -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.000 -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.001 -rw------- 1 casses1 casses1 1048576 Dec 21 16:43 filler.002 -????????? ? ? ? ? ? filler.003 drwx------ 2 casses1 casses1 25600 Dec 21 16:43 ~dmtmp The directory itself is a remote directory on one MDT: [root@quartz2311:~]# lfs getdirstripe -d /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0 lmv_stripe_count: 0 lmv_stripe_offset: 3 We are able to get striping information for this file: [root@quartz2311:~]# lfs getstripe /p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003
/p/lscratchh/casses1/quartz-zinc_3/19519/dbench/quartz2322/clients/client0/filler.003
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: 1
lmm_layout_gen: 0
lmm_stripe_offset: 27
obdidx objid objid group
27 20538776 0x1396598 0xcc0000402
It looks like the OSS serving that OST was rebooted and the OST went through recovery around the time the missing file was created. In particular, we note that the object number falls in the range of orphan objects that were deleted: [root@zinci:~]# grep 0xcc0000402 /var/log/conman/console.zinc* /var/log/conman/console.zinc43:2016-12-21 16:30:56 [189484.767900] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538706 to 0xcc0000402:20541649 /var/log/conman/console.zinc43:2016-12-21 16:33:30 [189639.110247] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649 /var/log/conman/console.zinc43:2016-12-21 16:35:41 [189769.704490] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649 /var/log/conman/console.zinc43:2016-12-21 16:40:19 [190047.449320] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538766 to 0xcc0000402:20541649 /var/log/conman/console.zinc43:2016-12-21 16:44:45 [190313.751155] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538820 to 0xcc0000402:20541649 /var/log/conman/console.zinc44:2016-12-21 16:49:27 [ 159.838420] Lustre: lsh-OST001b: deleting orphan objects from 0xcc0000402:20538820 to 0xcc0000402:20541649 I will attach server console logs separately. |
| Comments |
| Comment by Ned Bass [ 23/Dec/16 ] |
|
Also please note that we first observed this problem after our most recent Lustre update to 2.8.0_6chaos last Friday (December 16). The patches added in that update were: * 353716b (tag: 2.8.0_6.chaos, llnl/2.8.0-llnl) LU-8753 llog: add some debug patch * 17d469a LU-8936 llite: use percpu env correctly in ll_invalidatepage * a15b2ef LU-8361 lfsck: detect Lustre device automatically * 0220e0b LU-7648 man: new man pages for LFSCK commands * 1638a07 LU-7256 tests: wait current LFSCK to exit before next test * 1d8cfaa LU-8407 recovery: more clear message about recovery failure * fdea0d2 LU-7732 ldlm: silence verbose "waking for gap" log messages * 82e924c LU-8753 llog: remove lgh_write_offset * 3a8db9a LU-8493 osp: Do not set stale for new osp obj * 38c062b LU-7660 dne: support fs default stripe * bc3df36 Revert "LU-8422 update: add more debug info for the ticket" * 10170a0 Revert "LU-8422 llog: extended debug info" * 490414a Revert "LU-6635 lfsck: more debug message for sanity-lfsck test_18e" |
| Comment by Peter Jones [ 23/Dec/16 ] |
|
Mike Could you please assist with this issue? Thanks Peter |
| Comment by Ned Bass [ 27/Dec/16 ] |
|
I suspect this is related to |
| Comment by Mikhail Pershin [ 30/Dec/16 ] |
|
Ned, are these entries occurred once when OST was failed over or still continue to occur? Is it possible to remove them? I am checking patches you've mentioned. |
| Comment by Ned Bass [ 30/Dec/16 ] |
|
Hi Mikhail, Each occurrence that I've investigated happened immediately after the OST completed recovery. The object numbers of the missing files all fall at the beginning of the range of deleted orphans. It does not continue to occur when all OSTs are up. I can remove the files as root. The rm command fails for an unprivileged user because stat() returns ENONENT and rm treats that as fatal unless you're root. I have confirmed that I can reproduce
|
| Comment by Mikhail Pershin [ 09/Jan/17 ] |
|
Ned, so this issue is solved by Interesting that |
| Comment by Ned Bass [ 09/Jan/17 ] |
I have tested the The remaining work to do in that area is as follows.
My best guess as to why we started seeing |
| Comment by Alex Zhuravlev [ 10/Jan/17 ] |
|
with |
| Comment by Alex Zhuravlev [ 23/Jan/17 ] |
|
a prototype is under testing, I'm going to pass it through Maloo few more times.. |
| Comment by Peter Jones [ 01/Feb/17 ] |
|
This is now confirmed as a duplicate of |
| Comment by Peter Jones [ 27/Feb/17 ] |
|
AFAIK items tracked under this ticket are complete |