[LU-17538] lov_objseq file contains 0x0BD0 contstant in low bytes - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0, Lustre 2.16.0
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

While hitting the ~~LU-16692~~ sequence rollover LASSERT(), which caused an MDT to reboot in a loop, one of the MDT lov_objseq files looks like it had the low bytes of all FID SEQ values replaced by LOV_MAGIC_MAGIC or similar value (there are unfortunately a few different constants that have "0BD0" in them, like LUSTRE_MSG_MAGIC_V1):

# od  -Ax -tx8 lov_objseq
0000000                                 40c0000bd0                               4900000bd0
0000010                                 1240000bd0                               1ac0000bd0
0000020                                 23c0000bd0                               3100000bd0
0000030                                 48c0000bd0                               c40000bd0
0000040                                 3d40000bd0                               1f00000bd0
0000050                                 1280000bd0                               1ec0000bd0
0000060                                 4c40000bd0                               2880000bd0
0000070                                 10c0000bd0                               12c0000bd0
0000080                                 2680000bd0                               1100000bd0
0000090                                 4c80000bd0                               2d00000bd0
00000a0                                 2540000bd0                               a40000bd0
00000b0                                 600000bd0                               2ec0000bd0
00000c0                                 1400000bd0                               3940000bd0
00000d0                                 4140000bd0                               1e80000bd0
00000e0                                 940000bd0                               780000bd0
00000f0                                 1080000bd0                               2c00000bd0
:

In contrast, the lov_objseq on another MDT looked more as expected, close to the original "0x400" starting point and with some slight variation between OSTs due to usage and assignment of different SEQ values to MDTs in slightly different orders:

0000000                                 40c0000405                               490000040b
0000010                                 1240000404                               1ac000040d
0000020                                 23c000040b                               310000040b
0000030                                 48c000040b                               c4000040b
0000040                                 3d4000040c                               1f00000404
0000050                                 1280000404                               1ec0000404
0000060                                 4c40000406                               2880000409
0000070                                 10c0000407                               12c0000404
0000080                                 2680000409                               1100000407
0000090                                 4c80000401                               2d00000402
00000a0                                 254000040b                               a40000401
00000b0                                 600000408                               2ec0000402
00000c0                                 1400000403                               3940000401
00000d0                                 4140000405                               1e80000404
00000e0                                 940000402                               780000403
00000f0                                 1080000407                               2c00000404
:

The lov_objid fields for the OSTs looked reasonable for a system running with ~~LU-11912~~ to cause OST FID SEQ rollover to happen more quickly. The OID numbers in each case were fairly close to others within the same lov_objid file, though not very close to those in the other file.

dongyang do you have any thoughts on how the lov_objseq values could be affected in this way?

Attachments

Issue Links

is related to

LU-17658 sanity check when ofd assign a new sequence to osp

Resolved

is related to

LU-16692 replay-single: test_70c osp_fid_diff()) ASSERTION( fid_seq(fid1) == fid_seq(fid2) )

Resolved

LU-16720 large-scale test_3a osp_precreate_rollover_new_seq()) ASSERTION( fid_seq(fid) != fid_seq(last_fid) ) failed: fid [0x240000bd0:0x1:0x0], last_fid [0x240000bd0:0x3fff:0x0]

Resolved

LU-11912 reduce number of OST objects created per MDS Sequence

Resolved

Activity

[LU-17538] lov_objseq file contains 0x0BD0 contstant in low bytes

Gerrit Updater added a comment - 02/Jan/25 8:40 PM

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54191/
Subject: LU-17538 fid: do not use allocation set for ofd
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 66e51e654aaab6d9f1641cec6a5fa71766dbf197

Gerrit Updater added a comment - 02/Jan/25 8:40 PM "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54191/ Subject: LU-17538 fid: do not use allocation set for ofd Project: fs/lustre-release Branch: master Current Patch Set: Commit: 66e51e654aaab6d9f1641cec6a5fa71766dbf197

Gerrit Updater added a comment - 27/Feb/24 4:10 AM

"Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54191
Subject: LU-17538 fid: do not use allocation set for ofd
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 225058dfdfc541d42ebf1d5f0405015637783465

Gerrit Updater added a comment - 27/Feb/24 4:10 AM "Li Dongyang <dongyangli@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54191 Subject: LU-17538 fid: do not use allocation set for ofd Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 225058dfdfc541d42ebf1d5f0405015637783465

Andreas Dilger added a comment - 26/Feb/24 9:14 AM

If I recall correctly, the client SEQ WIDTH was chosen to avoid overflow fromaping 128-bit FIDs into 64-but inodes. I believe it could actually be increased to 256k objects per client without casing interference between fields when they are flattened.

Andreas Dilger added a comment - 26/Feb/24 9:14 AM If I recall correctly, the client SEQ WIDTH was chosen to avoid overflow fromaping 128-bit FIDs into 64-but inodes. I believe it could actually be increased to 256k objects per client without casing interference between fields when they are flattened.

Dongyang Li added a comment - 26/Feb/24 5:05 AM

I found out it's indeed for SEQ allocation to clients.
e.g. when the cluster first starts we begin with 0x200000401, and for every LUSTRE_METADATA_SEQ_MAX_WIDTH (128k) file/dir creates we get a new seq increased by 1
for example 0x200000402 to 0x200000403
If we restart the cluster, and after another 128k creates a new sequence will be 0x200000bd1,
because of the previous lowater/hiwater allocation advanced on disk record by 2000.
I guess it's done this way so we won't need to do a sync commit for every 128k creates from a single client vs now about 13m creates.
It does look like we should use different allocation set size. and at the same time I wonder why the MAX_WIDTH for METADATA is only 128k, comparing to the width for DATA 32M feels it's very short?

Dongyang Li added a comment - 26/Feb/24 5:05 AM I found out it's indeed for SEQ allocation to clients. e.g. when the cluster first starts we begin with 0x200000401, and for every LUSTRE_METADATA_SEQ_MAX_WIDTH (128k) file/dir creates we get a new seq increased by 1 for example 0x200000402 to 0x200000403 If we restart the cluster, and after another 128k creates a new sequence will be 0x200000bd1, because of the previous lowater/hiwater allocation advanced on disk record by 2000. I guess it's done this way so we won't need to do a sync commit for every 128k creates from a single client vs now about 13m creates. It does look like we should use different allocation set size. and at the same time I wonder why the MAX_WIDTH for METADATA is only 128k, comparing to the width for DATA 32M feels it's very short?

Dongyang Li added a comment - 23/Feb/24 10:55 AM

I've tried createmany to allocate lots FIDs on client, but that doesn't go through the lowater/hiwater allocation set code path. For now it looks like only meta sequence is using this. Will find out more and prepare a patch to remove the lowater/hiwater for meta sequence allocation.

Dongyang Li added a comment - 23/Feb/24 10:55 AM I've tried createmany to allocate lots FIDs on client, but that doesn't go through the lowater/hiwater allocation set code path. For now it looks like only meta sequence is using this. Will find out more and prepare a patch to remove the lowater/hiwater for meta sequence allocation.

Andreas Dilger added a comment - 23/Feb/24 1:50 AM

I think the lowater/hiwater are for SEQ allocation to clients? This may happen more frequently than MDT SEQ allocation (e.g. every client mount and 128k file creates per client, so possibly thousands of times when a cluster first starts), so possibly different values should be used for the two types of allocations.

Andreas Dilger added a comment - 23/Feb/24 1:50 AM I think the lowater/hiwater are for SEQ allocation to clients? This may happen more frequently than MDT SEQ allocation (e.g. every client mount and 128k file creates per client, so possibly thousands of times when a cluster first starts), so possibly different values should be used for the two types of allocations.

Dongyang Li added a comment - 22/Feb/24 12:07 PM

I think I know what's going on, it's because of seq range allocation sets
When we rollover the seq from osp, it sends RPC to seq server on ofd to allocate meta-sequence.
seq server on ofd checks and allocate super sequence, and inits the allocation sets, the lowater_set and
hiwater_set, both to LUSTRE_SEQ_BATCH_WIDTH, which is 1000. and writes the updated range to disk.
That's why we saw 0xbd0, e.g. we get a super sequence [0x240000400 - 0x280000400]
after the lowater/hiwater_set the on disk seq range now says [0x240000bd0 - 0x280000400]
and 0xbd0 - 0x400 = 2000.

Notice if we don't restart ofd/OST, the seq rollover still increases with 1, since it's now allocating from lowater/hiwater_sets. But once we restart ofd/OST, new seq rollover will see allocation begins from 0x240000bd0, and on disk seq range now advanced to 0x2400013a0, however we won't see the gap until another ofd/OST restart.

Now I wonder why do we need the lowater/hiwater allocation sets, to reduce the number of sync commits when the osp requests new meta seq? but that doesn't happen frequently. Maybe we could just remove it?

And I also found out ofd seq server doesn't track which seqs allocated to which osp, it means if we loss the commit to the disk updating seq range available, we do have the risk of reassigning the seq already given to one MDT to another. That also needs to be addressed.

Dongyang Li added a comment - 22/Feb/24 12:07 PM I think I know what's going on, it's because of seq range allocation sets When we rollover the seq from osp, it sends RPC to seq server on ofd to allocate meta-sequence. seq server on ofd checks and allocate super sequence, and inits the allocation sets, the lowater_set and hiwater_set, both to LUSTRE_SEQ_BATCH_WIDTH, which is 1000. and writes the updated range to disk. That's why we saw 0xbd0, e.g. we get a super sequence [0x240000400 - 0x280000400] after the lowater/hiwater_set the on disk seq range now says [0x240000bd0 - 0x280000400] and 0xbd0 - 0x400 = 2000. Notice if we don't restart ofd/OST, the seq rollover still increases with 1, since it's now allocating from lowater/hiwater_sets. But once we restart ofd/OST, new seq rollover will see allocation begins from 0x240000bd0, and on disk seq range now advanced to 0x2400013a0, however we won't see the gap until another ofd/OST restart. Now I wonder why do we need the lowater/hiwater allocation sets, to reduce the number of sync commits when the osp requests new meta seq? but that doesn't happen frequently. Maybe we could just remove it? And I also found out ofd seq server doesn't track which seqs allocated to which osp, it means if we loss the commit to the disk updating seq range available, we do have the risk of reassigning the seq already given to one MDT to another. That also needs to be addressed.

Dongyang Li added a comment - 19/Feb/24 9:52 AM

That's my impression as well, after the super-sequence allocation which has the 0x40000000 width, every time a MDT/osp wants to get a new seq wit the ofd/OST it should just increase 1. Some times there are other MDTs requested a new SEQ from the same ofd/OST before us, but we should not see a gap od 2000?
Checking where does the 2000 gap come from.

Dongyang Li added a comment - 19/Feb/24 9:52 AM That's my impression as well, after the super-sequence allocation which has the 0x40000000 width, every time a MDT/osp wants to get a new seq wit the ofd/OST it should just increase 1. Some times there are other MDTs requested a new SEQ from the same ofd/OST before us, but we should not see a gap od 2000? Checking where does the 2000 gap come from.

Andreas Dilger added a comment - 14/Feb/24 1:38 AM

I see in our internal test cluster that it appears to be doing something similar with the OST sequence number allocation:

ai400-002: Feb 13 03:34:57 ai400-002 kernel: Lustre: ai400x-OST000d-osc-MDT0002: update sequence from 0x640000bd0 to 0x6400013a0
ai400-002: Feb 13 05:39:35 ai400-002 kernel: Lustre: ai400x-OST001e-osc-MDT0002: update sequence from 0x9c0000406 to 0x9c0000bd0

It looks like 0x0bd0 - 0x0400 = 2000 and 0x13a0 - 0x0bd0 = 2000 as well, so this doesn't appear to be random corruption, but it isn't clear if it is intentional that the FID SEQ is being increased by 2000 each time (64B objects)? I don't think that the test system has actually created so many objects, and I would expect that there is a message printed every 32M that the FID SEQ is being increased by 1 each time an MDT-OST pair has created that many objects.

Andreas Dilger added a comment - 14/Feb/24 1:38 AM I see in our internal test cluster that it appears to be doing something similar with the OST sequence number allocation: ai400-002: Feb 13 03:34:57 ai400-002 kernel: Lustre: ai400x-OST000d-osc-MDT0002: update sequence from 0x640000bd0 to 0x6400013a0 ai400-002: Feb 13 05:39:35 ai400-002 kernel: Lustre: ai400x-OST001e-osc-MDT0002: update sequence from 0x9c0000406 to 0x9c0000bd0 It looks like 0x0bd0 - 0x0400 = 2000 and 0x13a0 - 0x0bd0 = 2000 as well, so this doesn't appear to be random corruption, but it isn't clear if it is intentional that the FID SEQ is being increased by 2000 each time (64B objects)? I don't think that the test system has actually created so many objects, and I would expect that there is a message printed every 32M that the FID SEQ is being increased by 1 each time an MDT-OST pair has created that many objects.

Andreas Dilger added a comment - 14/Feb/24 1:26 AM

I found a few places where the OST FID SEQ was being updated in the logs since the installation of a version with ~~LU-11912~~:

Feb  2 23:52:40 m001 kernel: Lustre: fs00-OST000d-osc-MDT0000: update sequence from 0x1000d0000 to 0x2880000bd0
:
Feb  9 22:26:03 m003 kernel: Lustre: fs00-OST000d-osc-MDT0006: update sequence from 0x1000d0000 to 0x28800013a0
:
Feb 10 05:51:01 m000 kernel: Lustre: fs00-OST000d-osc-MDT0000: update sequence from 0x1000d0000 to 0x2880001b71
:

I would have expected that the OSTs would be giving out sequence numbers to the MDTs in smaller chunks.

dongyang, could you please take a look at this and see if this is "normal" for sequence number allocation? It seems like a fairly large range to skip, given that each MDT should allocate about 32M objects per sequence number, so it is still possible that something is going wrong but we didn't notice it because the test filesystems are not very long lived...

Andreas Dilger added a comment - 14/Feb/24 1:26 AM I found a few places where the OST FID SEQ was being updated in the logs since the installation of a version with LU-11912 : Feb 2 23:52:40 m001 kernel: Lustre: fs00-OST000d-osc-MDT0000: update sequence from 0x1000d0000 to 0x2880000bd0 : Feb 9 22:26:03 m003 kernel: Lustre: fs00-OST000d-osc-MDT0006: update sequence from 0x1000d0000 to 0x28800013a0 : Feb 10 05:51:01 m000 kernel: Lustre: fs00-OST000d-osc-MDT0000: update sequence from 0x1000d0000 to 0x2880001b71 : I would have expected that the OSTs would be giving out sequence numbers to the MDTs in smaller chunks. dongyang , could you please take a look at this and see if this is "normal" for sequence number allocation? It seems like a fairly large range to skip, given that each MDT should allocate about 32M objects per sequence number, so it is still possible that something is going wrong but we didn't notice it because the test filesystems are not very long lived...

People

Assignee:: Dongyang Li

Reporter:: Andreas Dilger

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Feb/24 6:34 PM

Updated:: 02/Jan/25 8:40 PM