[LU-14119] FID-in-LMA [fid1] does not match the object self-fid [fid2] Created: 05/Nov/20  Updated: 16/Jul/21  Resolved: 19/Apr/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.12.7, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

zfs-0.7
3.10.0-1127.18.2.1chaos.ch6.x86_64
lustre-2.12.5_5.llnl-1.ch6.x86_64


Attachments: File try-0x4000034bf.0x3b1a.0x0.tar.gz    
Issue Links:
Related
is related to LU-13392 FID-in-LMA does not match the object ... Open
is related to LU-14175 OI Scrub triggered followed by LBUG A... Resolved
is related to LU-13608 MDT stuck in WAITING, abort_recov stu... Resolved
is related to LU-13124 lfsck check for multiple linked file ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In console log on the MDS:

Nov  5 09:16:10 zinc1 kernel: LustreError: 16479:0:(osd_object.c:481:osd_check_lma()) lsh-MDT0000: FID-in-LMA [0x200044dd2:0xcbc2:0x0] does not match the object self-fid [0x20003a324:0x8267:0x0]

One FID resolves with fid2path, and a stat of the path returned is successful:

[root@oslic8:~]# lfs fid2path /p/lustre2/ 0x200044dd2:0xcbc2:0x0
/p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp
[root@oslic8:~]# stat /p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp
  File: '/p/lustre2/settgast/GEOSX/src/cmake/blt/tests/internal/src/combine_static_library_test/Foo3.cpp'
  Size: 445       	Blocks: 56         IO Block: 4194304 regular file
Device: a2f6a642h/2734073410d	Inode: 144119920358116290  Links: 1
Access: (0600/-rw-------)  Uid: (56443/settgast)   Gid: (56443/settgast)
Access: 2020-03-29 21:07:15.000000000 -0700
Modify: 2020-03-29 21:07:16.000000000 -0700
Change: 2020-03-29 21:07:16.000000000 -0700
 Birth: -

But the other fid2path hangs:

[root@oslic8:~]# lfs fid2path /p/lustre2/ 0x20003a324:0x8267:0x0 &
[1] 53840

 

The hanging fid2path has this stack:

[root@oslic8:~]# cat /proc/53840/stack
[<ffffffffc17b8dd8>] ptlrpc_set_wait+0x4d8/0x800 [ptlrpc]
[<ffffffffc17b9183>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[<ffffffffc0e8fc99>] mdc_get_info+0x169/0x5c0 [mdc]
[<ffffffffc0e96f63>] mdc_iocontrol+0x2313/0x2c50 [mdc]
[<ffffffffc0e05b9a>] obd_iocontrol+0x7a/0x350 [lmv]
[<ffffffffc0e0dd8a>] lmv_fid2path+0x17a/0x850 [lmv]
[<ffffffffc0e0f750>] lmv_iocontrol+0x330/0x1a00 [lmv]
[<ffffffffc1138867>] ll_fid2path+0x4a7/0x7a0 [lustre]
[<ffffffffc112461b>] ll_dir_ioctl+0x7fb/0x6d80 [lustre]
[<ffffffff9fc75800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffff9fc75b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa01c1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff 

We see this on two different file systems (at least). Both were upgraded from Lustre 2.10 to Lustre 2.12 several months ago.



 Comments   
Comment by Olaf Faaland [ 05/Nov/20 ]

For my record-keeping, my local ticket is TOSS4904

There is another report LU-13392 which describes the same error message, but with some symptoms which differ.

In our case, the errors are being reported by MDTs. The two file systems were created under either Lustre 2.8 or 2.10. I can probably figure out which if necessary. I chose several messages at random (with distinct fid1 values). In all cases I could stat the file successfully.

Comment by Peter Jones [ 06/Nov/20 ]

Lai

Could you please investigate?

Thanks

Peter

Comment by Lai Siyao [ 11/Nov/20 ]

Can you dump debug log on MDS when 'lfs fid2path' hung? and the backtrace of the processes on MDS?

Was OI scrub enabled? If it is, when osd_check_lma() reports error, it should start OI scrub and try to fix it.

Comment by Olaf Faaland [ 20/Nov/20 ]

Hi Lai,

I'm attaching try-0x4000034bf.0x3b1a.0x0.tar.gz which has logs from both client and server, including backtraces obtained with sysrq.

If I'm reading dmesg correctly, OI scrub has been triggered repeatedly by these errors.  The currently running scrub on MDT0008 try-0x4000034bf.0x3b1a.0x0.tar.gzreports:

[root@copper9:~]# lctl get_param '*.*.oi_scrub'
osd-zfs.ls1-MDT0008.oi_scrub=
name: OI_scrub
magic: 0x4c5fd252
oi_files: 128
status: scanning
flags: auto
param:
time_since_last_completed: 185 seconds
time_since_latest_start: 159 seconds
time_since_last_checkpoint: 39 seconds
latest_start_position: 1
last_checkpoint_position: 11420721
first_failure_position: N/A
checked: 5726328
updated: 0
failed: 0
prior_updated: 0
noscrub: 0
igif: 0
success_count: 3929
run_time: 159 seconds
average_speed: 36014 objects/sec
real-time_speed: 29961 objects/sec
current_position: 13838695
scrub_in_prior: no
scrub_full_speed: yes
partial_scan: no 

thanks

Comment by Lai Siyao [ 06/Jan/21 ]

The sysrq backtrace is not complete, and doesn't have the mdt handling process which should be calling mdt_path() to resolve FID to path. And the debug logs don't have messages of related processes, which may not be captured in time. It'll be good to collect some useful logs when it's seen again.

I'm reviewing the OI scrub code.

Comment by Lai Siyao [ 07/Jan/21 ]

Can you enable LFSCK debug log by 'lctl set_param debug=+lfsck' and run 'lctl lfsck_start -A -t scrub -A' on MDT0, and dump debug logs on all MDTs when 'lctl lfsck_query -M ...' shows LFSCK finishes.

Comment by Lai Siyao [ 08/Jan/21 ]

'lfs fid2path' hung because OI scrub is running, and mdt_object_find() returns -EINPROGRESS, and upon such failure, client will keep retry. But if OI scrub can't fix this issue, the client will hang there forever, and since fid2path RPC is not interruptible, the command will get stuck (IMO this RPC should be interruptible). If OI scrub doesn't fix this issue, we should try LFSCK namespace check to see whether it works.

Comment by Gerrit Updater [ 14/Jan/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41218
Subject: LU-14119 lfsck: replace dt_lookup() with dt_lookup_dir()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5ad96bcf8773f442c483bff7ed11fcad551eae0b

Comment by Gerrit Updater [ 14/Jan/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41219
Subject: LU-14119 mdc: set fid2path RPC interruptible
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1c4713ca6e239544c780c2775e54f723ee2f4bc4

Comment by Gerrit Updater [ 18/Jan/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41261
Subject: LU-14119 lfsck: check linkea if it's newly added
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4deb09f5ada3021370e8af95dd1e34f9f95eb87e

Comment by Gerrit Updater [ 20/Jan/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41274
Subject: LU-14119 osd-zfs: enable LUDA_VERIFY
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 067cad2eec1e0119a2e1adeae6cca8b5384fe582

Comment by Gerrit Updater [ 03/Feb/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41402
Subject: LU-14119 osd: add mount option "resetoi"
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: c509f6e78caac81adc4369fd100ed6345b136d21

Comment by Andreas Dilger [ 03/Feb/21 ]

In a case like this where a bad FID->inode mapping is found (i.e. FID X points to inode Y, but inode Y has a different FID Z) then it would be better if the bad FID entry was deleted from the OI file.

One problem with deleting the whole OI is that it can take a long time to rebuild the OI for a large MDT, since OI Scrub may have to walk billions of inodes and write billions of records in random 64-byte chunks (i.e. the OI files for a large MDT might be 100GB+ in size). While the OI files are being rebuilt, clients may hang waiting on -EINPROGRESS RPCs to finish (though we try to avoid that), which might take an hour or more. Deleting one bad OI entry is a much lighter weight fix for the problem, but complete OI removal might be needed if it is totally corrupt for some reason, and I agree that "-o resetoi" is much easier for end users than doing a local mount and deleting the OI by hand.

One of the proposals when LFSCK was being developed was to save the old OI files during OI Scrub, then use them for lookup of entries that are not yet added into the new OI file. That would avoid most slowdowns for the clients during rebuild, and be a read-only backup of the OI if the main OI file is corrupted. The backup wouldn't be totally uptodate for new OI entries, but could avoid many problems until OI rebuild is finished, at which time the old OI files would be deleted (or kept until the next rebuild). That is work for a different ticket, however.

Comment by Gerrit Updater [ 24/Feb/21 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41741
Subject: LU-14119 osd: delete stale OI mapping entry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a82751aa6df41a4c1b9ab0d0b52c49ed8e86f9eb

Comment by Gerrit Updater [ 22/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41218/
Subject: LU-14119 lfsck: replace dt_lookup() with dt_lookup_dir()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d525ad4bd0d5d851405e4249859a1c77378f0ee3

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41219/
Subject: LU-14119 mdc: set fid2path RPC interruptible
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: bf475262610671534b1b1a33cebb49d8380b74f7

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41274/
Subject: LU-14119 osd-zfs: enable LUDA_VERIFY
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f5136e81957e4b67ae6ed7764d378b817fac5ee2

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41741/
Subject: LU-14119 osd: delete stale OI mapping entry
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 99d00b97ef5f209a002f250e7772055ff1a6d6d6

Comment by Gerrit Updater [ 06/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41402/
Subject: LU-14119 osd: add mount option "resetoi"
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f37bce8a573dfc5aac1b9f51f4d5c8314ba05d30

Comment by Gerrit Updater [ 10/Apr/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41261/
Subject: LU-14119 lfsck: check linkea if it's newly added
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afd00cacd0b6ef87282887b4e965350a9c1a6821

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43265
Subject: LU-14119 lfsck: replace dt_lookup() with dt_lookup_dir()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 84ee579a088feb57c9e3ee9c8e99089a3a05bc31

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43266
Subject: LU-14119 mdc: set fid2path RPC interruptible
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 77c002141b787bf393fe8ac03b972e2126c4f2a1

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43267
Subject: LU-14119 osd-zfs: enable LUDA_VERIFY
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 53622d22f887b4ea56735a27a02548515a6dc5ea

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43268
Subject: LU-14119 osd: delete stale OI mapping entry
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 301e2c111fdbf699572c4714c462945ec0b1765b

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43269
Subject: LU-14119 osd: add mount option "resetoi"
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1c878c85172025d201c8656f9c19084469fcec91

Comment by Gerrit Updater [ 12/Apr/21 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43270
Subject: LU-14119 lfsck: check linkea if it's newly added
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b54e4b3e398448a5a247f01c91acea801bf890d6

Comment by Peter Jones [ 19/Apr/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43266/
Subject: LU-14119 mdc: set fid2path RPC interruptible
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 9055a0a1f3720aed844941213c676508bd30a69e

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43270/
Subject: LU-14119 lfsck: check linkea if it's newly added
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3e59ed26d04da1fa2891b8f5263b0481d7c69f22

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43269/
Subject: LU-14119 osd: add mount option "resetoi"
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8f1a3a63e3157642025ab674fe2ba4a72cfa3151

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43267/
Subject: LU-14119 osd-zfs: enable LUDA_VERIFY
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3f6c9542373dea0321883506fa9bc2c9502629e3

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43268/
Subject: LU-14119 osd: delete stale OI mapping entry
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 53db3716685b67406580048c74e21c5c33db262b

Comment by Gerrit Updater [ 16/May/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43265/
Subject: LU-14119 lfsck: replace dt_lookup() with dt_lookup_dir()
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5d4356ebec16e8a08b566334d94e6422e4193e51

Generated at Sat Feb 10 03:07:00 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.