[LU-14119] FID-in-LMA [fid1] does not match the object self-fid [fid2] Created: 05/Nov/20 Updated: 16/Jul/21 Resolved: 19/Apr/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.12.7, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Olaf Faaland | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
zfs-0.7 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
In console log on the MDS: Nov 5 09:16:10 zinc1 kernel: LustreError: 16479:0:(osd_object.c:481:osd_check_lma()) lsh-MDT0000: FID-in-LMA [0x200044dd2:0xcbc2:0x0] does not match the object self-fid [0x20003a324:0x8267:0x0] One FID resolves with fid2path, and a stat of the path returned is successful: [root@oslic8:~]# lfs fid2path /p/lustre2/ 0x200044dd2:0xcbc2:0x0 /p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp [root@oslic8:~]# stat /p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp File: '/p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp' Size: 445 Blocks: 56 IO Block: 4194304 regular file Device: a2f6a642h/2734073410d Inode: 144119920358116290 Links: 1 Access: (0600/-rw-------) Uid: (56443/settgast) Gid: (56443/settgast) Access: 2020-03-29 21:07:15.000000000 -0700 Modify: 2020-03-29 21:07:16.000000000 -0700 Change: 2020-03-29 21:07:16.000000000 -0700 Birth: - But the other fid2path hangs: [root@oslic8:~]# lfs fid2path /p/lustre2/ 0x20003a324:0x8267:0x0 & [1] 53840
The hanging fid2path has this stack: [root@oslic8:~]# cat /proc/53840/stack [<ffffffffc17b8dd8>] ptlrpc_set_wait+0x4d8/0x800 [ptlrpc] [<ffffffffc17b9183>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [<ffffffffc0e8fc99>] mdc_get_info+0x169/0x5c0 [mdc] [<ffffffffc0e96f63>] mdc_iocontrol+0x2313/0x2c50 [mdc] [<ffffffffc0e05b9a>] obd_iocontrol+0x7a/0x350 [lmv] [<ffffffffc0e0dd8a>] lmv_fid2path+0x17a/0x850 [lmv] [<ffffffffc0e0f750>] lmv_iocontrol+0x330/0x1a00 [lmv] [<ffffffffc1138867>] ll_fid2path+0x4a7/0x7a0 [lustre] [<ffffffffc112461b>] ll_dir_ioctl+0x7fb/0x6d80 [lustre] [<ffffffff9fc75800>] do_vfs_ioctl+0x420/0x6d0 [<ffffffff9fc75b51>] SyS_ioctl+0xa1/0xc0 [<ffffffffa01c1112>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff We see this on two different file systems (at least). Both were upgraded from Lustre 2.10 to Lustre 2.12 several months ago. |
| Comments |
| Comment by Olaf Faaland [ 05/Nov/20 ] |
|
For my record-keeping, my local ticket is TOSS4904 There is another report LU-13392 which describes the same error message, but with some symptoms which differ. In our case, the errors are being reported by MDTs. The two file systems were created under either Lustre 2.8 or 2.10. I can probably figure out which if necessary. I chose several messages at random (with distinct fid1 values). In all cases I could stat the file successfully. |
| Comment by Peter Jones [ 06/Nov/20 ] |
|
Lai Could you please investigate? Thanks Peter |
| Comment by Lai Siyao [ 11/Nov/20 ] |
|
Can you dump debug log on MDS when 'lfs fid2path' hung? and the backtrace of the processes on MDS? Was OI scrub enabled? If it is, when osd_check_lma() reports error, it should start OI scrub and try to fix it. |
| Comment by Olaf Faaland [ 20/Nov/20 ] |
|
Hi Lai, I'm attaching try-0x4000034bf.0x3b1a.0x0.tar.gz which has logs from both client and server, including backtraces obtained with sysrq. If I'm reading dmesg correctly, OI scrub has been triggered repeatedly by these errors. The currently running scrub on MDT0008 try-0x4000034bf.0x3b1a.0x0.tar.gz [root@copper9:~]# lctl get_param '*.*.oi_scrub' osd-zfs.ls1-MDT0008.oi_scrub= name: OI_scrub magic: 0x4c5fd252 oi_files: 128 status: scanning flags: auto param: time_since_last_completed: 185 seconds time_since_latest_start: 159 seconds time_since_last_checkpoint: 39 seconds latest_start_position: 1 last_checkpoint_position: 11420721 first_failure_position: N/A checked: 5726328 updated: 0 failed: 0 prior_updated: 0 noscrub: 0 igif: 0 success_count: 3929 run_time: 159 seconds average_speed: 36014 objects/sec real-time_speed: 29961 objects/sec current_position: 13838695 scrub_in_prior: no scrub_full_speed: yes partial_scan: no thanks |
| Comment by Lai Siyao [ 06/Jan/21 ] |
|
The sysrq backtrace is not complete, and doesn't have the mdt handling process which should be calling mdt_path() to resolve FID to path. And the debug logs don't have messages of related processes, which may not be captured in time. It'll be good to collect some useful logs when it's seen again. I'm reviewing the OI scrub code. |
| Comment by Lai Siyao [ 07/Jan/21 ] |
|
Can you enable LFSCK debug log by 'lctl set_param debug=+lfsck' and run 'lctl lfsck_start -A -t scrub -A' on MDT0, and dump debug logs on all MDTs when 'lctl lfsck_query -M ...' shows LFSCK finishes. |
| Comment by Lai Siyao [ 08/Jan/21 ] |
|
'lfs fid2path' hung because OI scrub is running, and mdt_object_find() returns -EINPROGRESS, and upon such failure, client will keep retry. But if OI scrub can't fix this issue, the client will hang there forever, and since fid2path RPC is not interruptible, the command will get stuck (IMO this RPC should be interruptible). If OI scrub doesn't fix this issue, we should try LFSCK namespace check to see whether it works. |
| Comment by Gerrit Updater [ 14/Jan/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41218 |
| Comment by Gerrit Updater [ 14/Jan/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41219 |
| Comment by Gerrit Updater [ 18/Jan/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41261 |
| Comment by Gerrit Updater [ 20/Jan/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41274 |
| Comment by Gerrit Updater [ 03/Feb/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41402 |
| Comment by Andreas Dilger [ 03/Feb/21 ] |
|
In a case like this where a bad FID->inode mapping is found (i.e. FID X points to inode Y, but inode Y has a different FID Z) then it would be better if the bad FID entry was deleted from the OI file. One problem with deleting the whole OI is that it can take a long time to rebuild the OI for a large MDT, since OI Scrub may have to walk billions of inodes and write billions of records in random 64-byte chunks (i.e. the OI files for a large MDT might be 100GB+ in size). While the OI files are being rebuilt, clients may hang waiting on -EINPROGRESS RPCs to finish (though we try to avoid that), which might take an hour or more. Deleting one bad OI entry is a much lighter weight fix for the problem, but complete OI removal might be needed if it is totally corrupt for some reason, and I agree that "-o resetoi" is much easier for end users than doing a local mount and deleting the OI by hand. One of the proposals when LFSCK was being developed was to save the old OI files during OI Scrub, then use them for lookup of entries that are not yet added into the new OI file. That would avoid most slowdowns for the clients during rebuild, and be a read-only backup of the OI if the main OI file is corrupted. The backup wouldn't be totally uptodate for new OI entries, but could avoid many problems until OI rebuild is finished, at which time the old OI files would be deleted (or kept until the next rebuild). That is work for a different ticket, however. |
| Comment by Gerrit Updater [ 24/Feb/21 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41741 |
| Comment by Gerrit Updater [ 22/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41218/ |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41219/ |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41274/ |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41741/ |
| Comment by Gerrit Updater [ 06/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41402/ |
| Comment by Gerrit Updater [ 10/Apr/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41261/ |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43265 |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43266 |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43267 |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43268 |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43269 |
| Comment by Gerrit Updater [ 12/Apr/21 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43270 |
| Comment by Peter Jones [ 19/Apr/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43266/ |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43270/ |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43269/ |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43267/ |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43268/ |
| Comment by Gerrit Updater [ 16/May/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43265/ |