[LU-4395] files created with non-existent objects Created: 18/Dec/13 Updated: 29/Jan/14 Resolved: 29/Jan/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Kit Westneat (Inactive) | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 12059 |
| Description |
|
KIT has run into an issue where the MDT is creating files with objects that do not exist. Some of the symptoms look similar to On client: [root@client scc]# touch tmp/gaga touch: setting times of `tmp/gaga': No such file or directory [root@client scc]# lfs getstripe tmp/gaga tmp/gaga lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 5 obdidx objid objid group 5 65948624 0x3ee4bd0 0 25 66739551 0x3fa5d5f 0 9 65922640 0x3ede650 0 24 66084357 0x3f05e05 0 LustreError: 11-0: HC3WORK-OST0005-osc-ffff8804987dec00: Communicating with 172.26.3.138@o2ib, operation ldlm_enqueue failed with -12. [root@mds2 perftest]# ls -al ls: cannot access eaea: Cannot allocate memory ls: cannot access gaga: Cannot allocate memory total 12 drwxr-xr-x 3 er2341 scc 4096 Dec 12 21:40 . drwx------ 10 er2341 scc 4096 Sep 19 16:16 .. -rw-r--r-- 1 root root 0 Dec 12 21:41 e -????????? ? ? ? ? ? eaea -rw-r--r-- 1 root root 0 Dec 12 21:41 f -????????? ? ? ? ? ? gaga drwxr-xr-x 2 root root 4096 Dec 12 21:40 tmp on OSS: Dec 12 22:25:05 oss1 kernel: : LustreError: 14167:0:(ldlm_resource.c:1165:ldlm_resource_get()) HC3WORK-OST0005: lvbo_init failed for resource 0x3ee4bd0:0x0: rc = -2
On thing that's odd is that all the other OSTs on the system delete orphan objects around that object ID number, but not ost5: # echo $((0x3ee4bd0)) 65948624 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0002: deleting orphan objects from 0x0:66829746 to 0x0:66830014 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0006: deleting orphan objects from 0x0:66151265 to 0x0:66151535 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0000: deleting orphan objects from 0x0:66341886 to 0x0:66342155 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767207 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0003: deleting orphan objects from 0x0:66145109 to 0x0:66145379 Another weird thing is that the OSTs seem to delete the same objects repeatedly: Dec 9 15:04:30 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 Dec 9 15:11:41 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 Dec 10 09:58:31 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767047 Dec 10 16:20:25 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767079 Dec 10 16:33:00 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767111 Dec 11 15:54:57 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767143 Dec 11 16:50:03 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767175 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767207 The filesystem was put back into production by disabling the OSTs that have this symptom. Are there any suggestions for what to look at in order to further debug this issue? Any logs we should get? Thanks, |
| Comments |
| Comment by Kit Westneat (Inactive) [ 18/Dec/13 ] |
|
I've been looking at the prealloc info on the mds and the LAST_ID on the OST: [root@mds1 ~]# grep . /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/pre* /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_last_id:65948661 /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_last_seq:0x100050000 /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_next_id:65948632 /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_next_seq:0x100050000 /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_reserved:0 /proc/fs/lustre/osc/HC3WORK-OST0005-osc-MDT0000/prealloc_status:-19 # debugfs -c -R 'dump O/0/LAST_ID /tmp/LAST_ID' /dev/mapper/ost_HC3WORK_5 # hexdump -e '1/8 "%d\n"' /tmp/LAST_ID 65988995 So it looks like the OST's LAST_ID is significantly larger than what the MDT sees. What could be causing that? A corrupted last_rcvd file on the MDT? |
| Comment by Peter Jones [ 18/Dec/13 ] |
|
Niu Could you please look into this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 19/Dec/13 ] |
|
The LAST_ID on OST is much larger than the last_id on MDT, because the difference is even greater than the OST_MAX_PRECREATE (20000), the orhpan cleanup wasn't triggered on ost5. I'm not sure which LAST_ID is corrupted, was there any abnormal message on ost5 or MDT before this happened? That probably can give us some clue on how this happened. What's the id of last physical object on ost5? Could you have a check? |
| Comment by Kit Westneat (Inactive) [ 19/Dec/13 ] |
|
The last physical object ID is 65988995, so it seems like a problem with the MDT. I'll upload the MDS and OSS logs. Do you think deleting the last_rcvd file on the MDT would fix the issue? |
| Comment by Kit Westneat (Inactive) [ 19/Dec/13 ] |
|
mds2 is acting as the client in this case. |
| Comment by Niu Yawei (Inactive) [ 20/Dec/13 ] |
|
Is the mds running 2.4? I see some 2.1 symbols in the mds log. I didn't see how this happened from the log, and yes, deleting the lov_objids on MDS or just changing the bad value in lov_objids may fix the problem. |
| Comment by Kit Westneat (Inactive) [ 20/Dec/13 ] |
|
I think it was upgraded Dec 9, so it's possible that before then there are 2.1.x symbols. Do you know why it would keep deleting the same orphan objects? It seems like 65766910 is deleted at least 8 times. |
| Comment by Niu Yawei (Inactive) [ 23/Dec/13 ] |
You mean deleting orphan on ost4? Dec 9 15:04:30 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 ... Dec 9 15:11:41 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 ... Dec 10 09:58:31 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767047 ... Dec 10 16:20:25 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767079 ... Dec 10 16:33:00 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767111 ... Dec 11 15:54:57 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767143 ... Dec 11 16:50:03 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767175 ... Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767207 I think it probably there wasn't any file striped on ost4 was created, so orphan cleanup always starts from 65766910 (all beyond 65766910 are precreated objects). |
| Comment by Kit Westneat (Inactive) [ 23/Dec/13 ] |
|
Hmm, I think that is unlikely, as the filesystem is in production. I also don't understand why it would keep removing 65766910. Does it not keep track of orphans already deleted? |
| Comment by Niu Yawei (Inactive) [ 24/Dec/13 ] |
|
I mean no files striped on ost4 was created during Dec 9 ~ Dec 12, is it possible? The deleted orhpan will be created by precreation, that's why it has to delete orhpan on each reboot. |
| Comment by Oleg Drokin [ 09/Jan/14 ] |
|
Kit, can you please tell us if Niu's idea might be right that ost4 did not see any create activity on it? |
| Comment by Kit Westneat (Inactive) [ 16/Jan/14 ] |
|
It's possible, but I think unlikely. The filesystem was in use then. We were able to remove the last_rcvd file from the MDT, but that didn't work. I just reread your previous comment and realized that you were talking about the lov_objids file, oops. I'll try to get another downtime to remove the lov_objids file. Thanks, |
| Comment by Kit Westneat (Inactive) [ 29/Jan/14 ] |
|
Removing the lov_objids file seemed to fix the problem, so I think this ticket can be closed. |
| Comment by Peter Jones [ 29/Jan/14 ] |
|
ok thanks Kit! |