Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Lustre 2.4.1
-
None
-
3
-
12059
Description
KIT has run into an issue where the MDT is creating files with objects that do not exist. Some of the symptoms look similar to LU-4034.
On client:
[root@client scc]# touch tmp/gaga touch: setting times of `tmp/gaga': No such file or directory [root@client scc]# lfs getstripe tmp/gaga tmp/gaga lmm_stripe_count: 4 lmm_stripe_size: 1048576 lmm_layout_gen: 0 lmm_stripe_offset: 5 obdidx objid objid group 5 65948624 0x3ee4bd0 0 25 66739551 0x3fa5d5f 0 9 65922640 0x3ede650 0 24 66084357 0x3f05e05 0 LustreError: 11-0: HC3WORK-OST0005-osc-ffff8804987dec00: Communicating with 172.26.3.138@o2ib, operation ldlm_enqueue failed with -12. [root@mds2 perftest]# ls -al ls: cannot access eaea: Cannot allocate memory ls: cannot access gaga: Cannot allocate memory total 12 drwxr-xr-x 3 er2341 scc 4096 Dec 12 21:40 . drwx------ 10 er2341 scc 4096 Sep 19 16:16 .. -rw-r--r-- 1 root root 0 Dec 12 21:41 e -????????? ? ? ? ? ? eaea -rw-r--r-- 1 root root 0 Dec 12 21:41 f -????????? ? ? ? ? ? gaga drwxr-xr-x 2 root root 4096 Dec 12 21:40 tmp
on OSS:
Dec 12 22:25:05 oss1 kernel: : LustreError: 14167:0:(ldlm_resource.c:1165:ldlm_resource_get()) HC3WORK-OST0005: lvbo_init failed for resource 0x3ee4bd0:0x0: rc = -2
On thing that's odd is that all the other OSTs on the system delete orphan objects around that object ID number, but not ost5:
# echo $((0x3ee4bd0)) 65948624 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0002: deleting orphan objects from 0x0:66829746 to 0x0:66830014 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0006: deleting orphan objects from 0x0:66151265 to 0x0:66151535 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0000: deleting orphan objects from 0x0:66341886 to 0x0:66342155 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767207 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0003: deleting orphan objects from 0x0:66145109 to 0x0:66145379
Another weird thing is that the OSTs seem to delete the same objects repeatedly:
Dec 9 15:04:30 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 Dec 9 15:11:41 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767015 Dec 10 09:58:31 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767047 Dec 10 16:20:25 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767079 Dec 10 16:33:00 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767111 Dec 11 15:54:57 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767143 Dec 11 16:50:03 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767175 Dec 12 15:01:02 oss1 kernel: : Lustre: HC3WORK-OST0004: deleting orphan objects from 0x0:65766910 to 0x0:65767207
The filesystem was put back into production by disabling the OSTs that have this symptom. Are there any suggestions for what to look at in order to further debug this issue? Any logs we should get?
Thanks,
Kit