Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.10.4
-
server: RHEL 7.4 derivative, zfs-0.7.11-4llnl.ch6.x86_64, lustre-2.10.4_1.chaos
client: RHEL 7.4 derivative, lustre-2.8.2_4.chaos-1
We are using DNE 1 with two MDTs on two servers, porter81 and porter82
for the zfs tag, see https://github.com/LLNL/zfs/releases
for our 2.10 tag, see https://github.com/LLNL/lustre/
for our 2.8 tag, see lustre-release-fe-llnl on gerrittserver: RHEL 7.4 derivative, zfs-0.7.11-4llnl.ch6.x86_64, lustre-2.10.4_1.chaos client: RHEL 7.4 derivative, lustre-2.8.2_4.chaos-1 We are using DNE 1 with two MDTs on two servers, porter81 and porter82 for the zfs tag, see https://github.com/LLNL/zfs/releases for our 2.10 tag, see https://github.com/LLNL/lustre/ for our 2.8 tag, see lustre-release-fe-llnl on gerritt
-
2
-
9223372036854775807
Description
A directory has an entry for subdirectory "2fe", but the object ID stored for that entry does not exist:
alias ll="ls -l" [root@catalyst101:~]# ll /p/lustre3/videousr/YLI/mmcommons/data/images_v1 ls: cannot access /p/lustre3/videousr/YLI/mmcommons/data/images_v1/2fe: No such file or directory total 0 d????????? ? ? ? ? ? 2fe
And when using zdb on the MDT to examine images_v1, one sees that 2fe refers to an object ID that is invalid:
[root@porter81:snap]# zdb -ddddd porter81/mdt0 533741247 Dataset porter81/mdt0 [ZPL], ID 148, cr_txg 98, 910G, 61852198 objects, rootbp DVA[0]=<4:88d9c400:200> DVA[1]=<5:25ca03c200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1214040L/1214040P fill=61852198 cksum=139cf672b7:5dc8d6146f6:f8e6add4f57c:1e27e38477f5c0 Object lvl iblk dblk dsize dnsize lsize %full type 533741247 2 128K 16K 231K 512 528K 100.00 ZFS directory 192 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED SPILL_BLKPTR dnode maxblkid: 32 path ???<object#533741247> uid 0 gid 2093 atime Mon Oct 8 11:01:28 2018 mtime Wed Oct 3 15:53:08 2018 ctime Wed Oct 3 15:53:08 2018 crtime Mon Oct 1 20:53:54 2018 gen 1090081 mode 42700 size 2 parent 533740502 links 3 pflags 0 rdev 0x0000000000000000 SA xattrs: 204 bytes, 3 entries trusted.lma = \000\000\000\000\000\000\000\0002@\000\000\002\000\000\000\245\037\001\000\000\000\000\000 trusted.link = \337\361\352\021\001\000\000\0003\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\033\000\000\000\002\000\000@F\000\0001\213\000\000\000\000images_v1 trusted.version = \022\231\236+\011\000\000\000 Fat ZAP stats: Pointer table: 1024 elements zt_blk: 0 zt_numblks: 0 zt_shift: 10 zt_blks_copied: 0 zt_nextblk: 0 ZAP entries: 1 Leaf blocks: 32 Total blocks: 33 zap_block_type: 0x8000000000000001 zap_magic: 0x2f52ab2ab zap_salt: 0x3e3cbee7f Leafs with 2^n pointers: 5: 32 ******************************** Blocks with n*5 entries: 0: 32 ******************************** Blocks n/10 full: 1: 32 ******************************** Entries with n chunks: 4: 1 * Buckets with n entries: 0: 16383 **************************************** 1: 1 * 2fe = 533742980 (type: Directory) Indirect blocks: 0 L1 6:1a0095d000:a00 20000L/a00P F=33 B=1133009/1133009 0 L0 4:d99372200:200 4000L/200P F=1 B=1133009/1133009 4000 L0 4:2b78affa00:e00 4000L/e00P F=1 B=1132989/1132989 8000 L0 4:1a409fa00:e00 4000L/e00P F=1 B=1133008/1133008 c000 L0 4:dbecc8800:e00 4000L/e00P F=1 B=1133003/1133003 10000 L0 4:2d07544a00:e00 4000L/e00P F=1 B=1132997/1132997 14000 L0 5:11130c9600:e00 4000L/e00P F=1 B=1133005/1133005 18000 L0 5:1053a11c00:e00 4000L/e00P F=1 B=1132991/1132991 1c000 L0 4:2d07545800:e00 4000L/e00P F=1 B=1132997/1132997 20000 L0 6:1a41dd7c00:e00 4000L/e00P F=1 B=1133002/1133002 24000 L0 5:112ca4cc00:e00 4000L/e00P F=1 B=1133007/1133007 28000 L0 5:559e31000:e00 4000L/e00P F=1 B=1133000/1133000 2c000 L0 4:d91a7e000:e00 4000L/e00P F=1 B=1133004/1133004 30000 L0 4:d99372400:e00 4000L/e00P F=1 B=1133009/1133009 34000 L0 4:265bf62800:e00 4000L/e00P F=1 B=1132993/1132993 38000 L0 6:134c5fcc00:e00 4000L/e00P F=1 B=1132992/1132992 3c000 L0 5:559e31e00:e00 4000L/e00P F=1 B=1133000/1133000 40000 L0 5:11130ca400:e00 4000L/e00P F=1 B=1133005/1133005 44000 L0 4:dbeccac00:e00 4000L/e00P F=1 B=1133003/1133003 48000 L0 4:2b78b02200:e00 4000L/e00P F=1 B=1132989/1132989 4c000 L0 6:134c5ff400:e00 4000L/e00P F=1 B=1132992/1132992 50000 L0 4:1a40a2400:e00 4000L/e00P F=1 B=1133008/1133008 54000 L0 5:11130cb200:e00 4000L/e00P F=1 B=1133005/1133005 58000 L0 6:19f0f10c00:e00 4000L/e00P F=1 B=1132991/1132991 5c000 L0 4:1a40a3200:e00 4000L/e00P F=1 B=1133008/1133008 60000 L0 7:b97b6aa00:e00 4000L/e00P F=1 B=1133004/1133004 64000 L0 5:112ca4f400:e00 4000L/e00P F=1 B=1133007/1133007 68000 L0 4:17f825800:e00 4000L/e00P F=1 B=1132999/1132999 6c000 L0 6:1a2429de00:e00 4000L/e00P F=1 B=1132995/1132995 70000 L0 6:1a41dd9a00:e00 4000L/e00P F=1 B=1133002/1133002 74000 L0 7:129d29e800:e00 4000L/e00P F=1 B=1133007/1133007 78000 L0 4:dbeccca00:e00 4000L/e00P F=1 B=1133003/1133003 7c000 L0 4:17f826600:e00 4000L/e00P F=1 B=1132999/1132999 80000 L0 5:569fa5000:e00 4000L/e00P F=1 B=1132994/1132994 segment [0000000000000000, 0000000000084000) size 528K [root@porter81:snap]# zdb -ddddd porter81/mdt0 533742980 Dataset porter81/mdt0 [ZPL], ID 148, cr_txg 98, 910G, 61852198 objects, rootbp DVA[0]=<4:88d9c400:200> DVA[1]=<5:25ca03c200:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=1214040L/1214040P fill=61852198 cksum=139cf672b7:5dc8d6146f6:f8e6add4f57c:1e27e38477f5c0 Object lvl iblk dblk dsize dnsize lsize %full type zdb: dmu_bonus_hold(533742980) failed, errno 2
This is on a new file system that has not been used by end-users yet, but which we attempted to copy data to. More specifically:
1, We copied about 500 million files/dirs to it
2. We tried to use lfs migrate -M to move some large subtrees from one MDT to another, but that failed due to a Lustre 2.8 bug with lfs migrate
3. We deleted most of the files/dirs
- The servers did not crash, as far as I can recall, while we were performing all the copy and delete operations. But I cannot be certain of that.
- We inspected the console logs on the servers and clients but found nothing that sounded like it indicated object creation or destruction failing.