[LU-12712] sanity-pfl tests triggering “not SEL magic on SEL file” Created: 28/Aug/19  Updated: 15/Jan/20  Resolved: 09/Oct/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10070 PFL self-extending file layout Resolved
is related to LU-13143 detect console spew during (interop) ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Looking at the console log for MDSs when running sanity-pfl, we are seeing messages like

[13646.271068] Lustre: 21851:0:(lod_lov.c:1358:lod_parse_striping()) lustre-MDT0001-mdtlov: not SEL magic on SEL file [0x240014454:0x77a:0x0]: bd30bd0
[13646.274907] Lustre: 21851:0:(lod_lov.c:1358:lod_parse_striping()) lustre-MDT0001-mdtlov: not SEL magic on SEL file [0x240014454:0x77a:0x0]: bd30bd0

We are seeing this for sanity-pfl tests 19e, 20b, 20c and 20d and when cleaning up after sanity-pfl.

Examples of this message in the MDS (vm4 and vm5) console logs are at
https://testing.whamcloud.com/test_sets/fedacd8c-c944-11e9-90ad-52540065bddc
https://testing.whamcloud.com/test_sets/b04ecaf6-c953-11e9-97d5-52540065bddc



 Comments   
Comment by Patrick Farrell (Inactive) [ 28/Aug/19 ]

Having taken a look at this, these are all of the sanity-pfl tests which have SEL & replay in them.

The issue seems to be this:
https://review.whamcloud.com/#/c/35144/8/lustre/lod/lod_object.c@1551

I'm hoping Vitaly can comment on why that line is there, because I can't see why it's there at all.

Also, as noted there, I think once this warning is fixed, we should make it return an error or assert or something - It is an on disk format discrepancy, and the warning clearly isn't enough, because I believe it's been triggering ever since it was added.

Comment by Peter Jones [ 21/Sep/19 ]

vitaly_fertman Cory reported on the LWG call that you considered this to be a lower priority issue. As such is it ok for this test to be added to the always except list until it can be fixed properly in a future release?

Comment by Andreas Dilger [ 01/Oct/19 ]

Looking at this more closely, it appears that there is just a bug in how the check is done for the error message, since it is checking the magic of the component (which is always LOV_MAGIC_V3) instead of the magic for the file (which should be LOV_MAGIC_SEL).

That said, it isn't totally clear what can/should be done with this error message, if it should legitimately be hit in the field, as opposed to (IMHO) being spuriously printed because of a defect in the logic. Clearly it is a discrepancy in the on-disk format, but it doesn't appear to be harmful. However, it would be better if there was an LFSCK check for this and a repair, since no administrator would be able to fix this short of deleting the file, and it will just be console noise.

Comment by Gerrit Updater [ 01/Oct/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36351
Subject: LU-12712 lod: fix warning message for non-SEL file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ecfc198003dba32d24805f2daa3fe8aa4d4706cb

Comment by Gerrit Updater [ 09/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36351/
Subject: LU-12712 lod: fix warning message for non-SEL file
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 544fd725dc5773aa8fdd931e04eb51d658ee1686

Comment by Peter Jones [ 09/Oct/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:54:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.