[LU-14098] LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Created: 31/Oct/20  Updated: 27/Oct/22  Resolved: 26/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Campbell Mcleay (Inactive) Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: LTS12
Environment:

CemtOS 7.6.1810


Attachments: File 2021-01-29-syslog.log.gz     File bmds1-msgs.txt.gz    
Issue Links:
Gantt End to Start
Related
is related to LU-14056 MDT filesystem full, fsck crashing Resolved
is related to LU-14165 llog_reader: Add debugging informatio... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Getting the following errors in messages log on MDS:

Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages
Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages


 Comments   
Comment by Andreas Dilger [ 02/Nov/20 ]

Alex, can you please comment - does the "osp_sync_declare_add()) logging isn't available" message mean that OST objects will not be destroyed? Is this affecting only a single OST, or all OSTs? Improving the error message to include the source and target device names would be very helpful.

What would be needed to repair the llog file here? I think the LFSCK message is regarding cleaning up orphans?

cmcl, could you check if creating and deleting OST objects on each OST is consuming and releasing space as expected? e.g. create a file on each OST with "lfs setstripe -i $ost_idx /path/to/file" and write 1GB there, then delete it and verify with "lfs df" before and after to confirm space is being released.

Comment by Campbell Mcleay (Inactive) [ 02/Nov/20 ]

Hi Andreas,

Just want to make sure I'm doing the right thing: I can get the $ost_idx from e.g., 'lctl dl' (number in the left hand column) or 'lfs df' (number in the right hand column), and I should run this command on the client, correct?

Thanks,

Campbell

Comment by Andreas Dilger [ 03/Nov/20 ]

Campbell, the OST index is the number on the right for "lfs df" (ie. bravo-OST0016 is OST index 0x16, or 22. It is possible to use either the hex or decimal values with "lfs setstripe -i", which should be run on the client.

Comment by Alex Zhuravlev [ 03/Nov/20 ]

adilger yes, that message from osp_sync_declare_add() means that no llog record will be made and orphan will be left for that object. all OSTs are independed in this regard.
I'll make a patch improving the message.
and yes, LFSCK is intended to clean the resulting orphans.
AFAIU (to be verified) MDT should be restarted to use llog again.

Comment by Andreas Dilger [ 04/Nov/20 ]

Alex, in the case that the llog is not working, is it possible for the MDS to continue to destroy the OST objects without the llog for recovery? Clearly this would not handle the case when the MDS is rebooted, but it can't be worse than not doing any object destroy operations?

Comment by Alex Zhuravlev [ 04/Nov/20 ]

the original idea was to destroy OST object when corresponding unlink is committed. then to save on commit callbacks OSP tracks last committed logid and then, getting records back the llog (they are still in cache usually) OSP issues OST_DESTROY RPC.
I think it's possible to implement commit-callback-per-object schema either as a primary one or as a fallack method.

Comment by Campbell Mcleay (Inactive) [ 04/Nov/20 ]

Currently, there's quite a lot of writes and deletions happening as this cluster catches up on backups. Am waiting for a quiet time when the results will be clear to determine whether space in OSTs is consumed/released as per writes/deletes as expected

Comment by Andreas Dilger [ 04/Nov/20 ]

Alex, can you please explain why the llog context is not being initialized properly? I'd think there is something wrong/missing with the llog itself, but I'm not sure how to debug it.

Comment by Alex Zhuravlev [ 05/Nov/20 ]

iirc, the most frequent reason - -EINPROGRESS when something went wrong with OI and OSP couldn't open the llog at mounting.
it would be helpful to see the logs since boot time.

Comment by Campbell Mcleay (Inactive) [ 05/Nov/20 ]

Hi Alex,

Attached are the logs from the mds bmds1-msgs.txt.gz .

Kind regards,

Campbell

Comment by Alex Zhuravlev [ 05/Nov/20 ]

thanks, the llog got corrupted:

Oct 30 20:29:06 bmds1 kernel: LustreError: 19089:0:(llog_osd.c:987:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096

the easiest solution would be to stop that MDT, remove the llog manually and remount again.
would a detailed instruction help you?

Comment by Campbell Mcleay (Inactive) [ 05/Nov/20 ]

Hi Alex, detailed instructions would be a great help, thank you.

Comment by Alex Zhuravlev [ 08/Nov/20 ]

your MDT should be unmounted, then please follow the example:

# debugfs -c <mdt device> -R "stat /O/1/d11/88139"
...
Inode: 159   Type: regular    Mode:  0644   Flags: 0x4000
Generation: 2694950587    Version: 0x00000000:00000000
User:     0   Group:     0   Project:     0   Size: 11520
File ACL: 0
....
Extended attributes:
  lma: fid=[0x1:0x1584b:0x0] compat=0 incompat=0
...

at this point please check that fid has 0x1584b in the line starting with lma: fid=...
if it is, then you can remove that file using direcly mounted ldiskfs

mount -t ldiskfs <mdt device> <mountpoint>
rm <mountpoint>/O/1/d11/88139
umount <mountpoint>

now you can remount MDT. I'd suggest to run LFSCK to clean orphans.

Comment by Campbell Mcleay (Inactive) [ 16/Nov/20 ]

Thanks Alex, will let you know when I have done this. Still waiting for a window

Comment by Gerrit Updater [ 25/Nov/20 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40754
Subject: LU-14098 obdclass: ignore and remove corrupted plain llogs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b7bbd8993bfb3bcf49acc8b677a8986f8d785d65

Comment by Campbell Mcleay (Inactive) [ 25/Jan/21 ]

Hi all. We're seeing another occurrence of this on another Lustre cluster. We have 6 OSTs that are now showing 100% full. I've used the recommendation on this page: https://wiki.lustre.org/Handling_Full_OSTs to set max_create_count to 0 on those. I've also e2fsck'd the MDT. Here's the main snippet from the syslog:

[Jan25 10:18] LustreError: 40160:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
[  +0.001271] LustreError: 40160:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 16497 previous similar messages
[Jan25 10:21] LustreError: 4753:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
[  +0.002111] LustreError: 4753:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 76 previous similar messages

Given the processes followed above, I've grepped through our syslog for the past while, looking for the invalid llog tail info and see the following:

2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4811:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 bytes 4096
2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4907:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x590f:0x1:0x0]:0 offset 4747264 last_rec idx 1811939328 tail idx 0lrt len 8925011 read_size 4096
2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4973:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x592c:0x1:0x0]:0 offset 2904064 bytes 4096
2020-12-19-syslog.log:Dec 19 00:25:21 emds1 kernel: LustreError: 4752:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 last_rec idx 2 tail idx 0lrt len 8590435 read_size 4096
2020-12-19-syslog.log:Dec 19 00:25:21 emds1 kernel: LustreError: 4823:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5954:0x1:0x0]:0 offset 626688 bytes 4096
2020-12-19-syslog.log:Dec 19 00:25:22 emds1 kernel: LustreError: 4987:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5934:0x1:0x0]:0 offset 2920448 last_rec idx 0 tail idx 33065lrt len 88 read_size 4096
2021-01-25-syslog.log:Jan 25 13:41:45 emds1 kernel: LustreError: 4492:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 bytes 4096
2021-01-25-syslog.log:Jan 25 13:41:45 emds1 kernel: LustreError: 4589:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x590f:0x1:0x0]:0 offset 4747264 last_rec idx 9787 tail idx 0lrt len 8925011 read_size 4096

Given those entries, would I be right in thinking I should do the following:

Convert those 0x59XX values to decimal so:

0x590f is 22799
0x5919 is 22809
0x5934 is 22836
0x5954 is 22868
0x592c is 22828

I'd then run: debugfs -c <mdt device> -R "stat /O/1/d11/<insert decimal number here>"

and check the lma fid matches, then delete those files and remount?

Please let me know if that's correct or if you require any more info to get a full idea of what's going on. As in the past, we're trying to get a set of good backups through so I'll be trying to wait for those to finish happily before taking the MDT down again.

Comment by Campbell Mcleay (Inactive) [ 25/Jan/21 ]

(I should note that following the fsck of the MDT, I have an lfsck running as suggested by the syslog.)

Comment by Andreas Dilger [ 26/Jan/21 ]

Campbell, you could try Alex's patch: https://review.whamcloud.com/40754 "LU-14098 obdclass: ignore and remove corrupted plain llogs" on the MDS. This should clean up the broken llogs automatically when they are encountered rather than doing it by hand.

However, the process that you describe in your previous comment also looks correct, and may be more expedient for the short term.

Comment by Campbell Mcleay (Inactive) [ 27/Jan/21 ]

Thanks Andreas. For now I'm probably gonna go with the simple deletions once the current run of backups are done but we definitely look forward to seeing that patch in an upcoming release.

Comment by Campbell Mcleay (Inactive) [ 29/Jan/21 ]

I've removed those log files - they weren't in d11 but neighbour directories - I checked the lma fids on them and they all matched so seemed correct.

fsck run on the MDT (which made some changes) and now remounted. Running the lfsck --all -M echo-MDT00010 now.

Should we expect to see the space on the OSTs gradually free as the lfsck works its way through things over the next few days? Does it happen in a particular scanning-phase?

Comment by Andreas Dilger [ 30/Jan/21 ]

If you don't add an lfsck option to clean up the orphans then it will just report that they exist. If you add "-o" it will link orphan objects under $MOUNT/.lustre/lost+found/MDTxxxx for review and cleanup (i.e. "rm"). See the lctl-lfsck-start(8) man page for details.

Comment by Campbell Mcleay (Inactive) [ 01/Feb/21 ]

Would these errors in the MDS syslog be down to the lfsck sorting things out, or is this a separate issue:

[  +4.946556] LDISKFS-fs warning: 844 callbacks suppressed
[  +0.000008] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2
[  +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem
[  +0.003452] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2
[  +0.000005] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem
[  +0.002557] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2
[  +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem
[  +0.017404] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2
[  +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem
[  +0.006475] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2
[  +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem
[  +0.550317] Lustre: 46464:0:(osd_handler.c:501:osd_ldiskfs_add_entry()) echo-MDT0000: directory (inode: 769803009, FID: [0x2000207a3:0x1:0x0]) has reached maximum entry limit
[  +0.000008] Lustre: 46464:0:(osd_handler.c:501:osd_ldiskfs_add_entry()) Skipped 7907 previous similar messages
Comment by Campbell Mcleay (Inactive) [ 01/Feb/21 ]

Ah - my guess is that this is putting all the ophans in .lustre/lost+found/MDT0000 and after repairing 12M layouts (so far) it can't put any more files in the unstructured lost+found dir.

Not sure what effect that's gonna have on the ongoing lfsck. An ls or a find in the lost+found dir hangs so I think my guess is correct.

I'm gonna run a find . -exec /bin/rm -f to try and clear those files

Comment by Andreas Dilger [ 02/Feb/21 ]

If you are fairly confident that these objects under .lustre/lost+found/MDT0000/ are orphans, then deleting them is the right way forward.

Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ]

With the lfsck finished and the find rm finished, 5 of the 6 100% full OSTs are now near empty so I've restored their max_create_count values. One OST is still however 100% full and we're still seeing the following in dmesg:

[Feb 2 06:25] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
[  +0.001714] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 34991 previous similar messages

I've double checked all the syslogs we have going back a couple of months for occurrences of 'llog tail' and can only find the ones I listed above. Do you have any recommendations for how to go about finding the bad log file?

Comment by Andreas Dilger [ 02/Feb/21 ]

Campbell, it would be most useful to check for "LustreError:" messages reported after MDT0000 was mounted when you deleted the llog objects. If you could attach the MDS /var/log/messages file from that time it would be helpful, along with knowing which specific OST is still having the problem. It might have printed something like "echo-OST0003-osc-MDT0000: ctxt 2 lop_setup=0xffff001234567890 failed: rc = -22".

Unfortunately, the osp_sync_declare_add() message doesn't report which OST is having the problem. Even so, this function is only trying to get a reference on an existing data structure in memory, so even if it included the OST name it wouldn't be able to print anything about the specific llog file that was having the problem. That would have been loaded at startup time. If the logs from the previous mount don't show anything, it might be possible to re-trigger this by only unmounting and remounting the OST, and capturing the Lustre debug logs on the MDS:

oss# umount /mnt/<ost_mount> (or whatever HA command is used to stop the OST)
mds# lctl set_param debug=+info+config debug_mb=1024
mds# lctl clear
oss# mount /mnt/<ost_mount> (or as appropriate)
[wait a few seconds for OST to connect to the MDS]
mds# lctl dk /tmp/debug

That would be more useful than having to remount the MDT and having the logs from all of the OSTs reconnecting mixed together.

Alex, it would be useful to make a small patch (separate from patch https://review.whamcloud.com/40754 "LU-14098 obdclass: try to skip corrupted llog records") that improves the error message in osp_sync_declare_add() - add in the MDT name (at the start) and OST device name (in the message). That would at least allow debugging which OST (and which MDT in DNE mode) the problem is related to. It looks like the 40754 patch does not touch this function, so there wouldn't be any conflicts.

Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ]

The OST in question is: echo-OST0028_UUID

I'll paste below the LustreError lines from that day's syslog and I'll attach the whole thing anyway. You'll see an unmount at about 9:50am when I deleted the logs, then a reboot followed by an e4fsck, one more reboot then a 'normal' mount.

I'll try the sequence you suggest shortly.

Jan 29 00:17:42 emds1 kernel: LustreError: 21600:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110
Jan 29 02:51:00 emds1 kernel: LustreError: 4436:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.23.22.113@tcp  ns: mdt-echo-MDT0000_UUID lock: ffff9b73b28f33c0/0x3dc1d8a5f466cd2c lrc: 3/0,0 mode: PR/PR res: [0x20002bd3e:0x1728b:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.23.22.113@tcp remote: 0x2f5cecde3bdac8df expref: 440542 pid: 12166 timeout: 306605 lvb_type: 0
Jan 29 02:51:02 emds1 kernel: LustreError: 12229:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.23.22.113@tcp arrived at 1611917462 with bad export cookie 4450076092389702753
Jan 29 08:32:36 emds1 kernel: LustreError: 21599:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110
Jan 29 09:46:31 emds1 kernel: LustreError: 137-5: echo-MDT0000_UUID: not available for connect from 10.23.22.103@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
Jan 29 09:46:31 emds1 kernel: LustreError: Skipped 15 previous similar messages
Jan 29 12:00:37 emds1 kernel: LustreError: 4491:0:(llog_cat.c:395:llog_cat_id2handle()) echo-OST0003-osc-MDT0000: error opening log id [0x5919:0x1:0x0]:0: rc = -2
Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096
Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
Jan 29 12:00:37 emds1 kernel: LustreError: 3562:0:(mdt_handler.c:6696:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device
Jan 29 12:00:37 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:00:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:00:58 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:01:08 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:01:18 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:01:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 1 previous similar message
Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 3 previous similar messages
Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 8 previous similar messages
Jan 29 14:17:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 32 previous similar messages
Jan 29 15:40:44 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4 previous similar messages
Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages
Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 13 previous similar messages
Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 22 previous similar messages
Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 33 previous similar messages
Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 42 previous similar messages
Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages
Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1051 previous similar messages
Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 289 previous similar messages
Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1392 previous similar messages
Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2621 previous similar messages
Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 776 previous similar messages
Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4142 previous similar messages
Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 3114 previous similar messages
Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x2852010002000000: rc = -2
Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 46 previous similar messages
Jan 29 17:22:47 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xfa31010002000000: rc = -2
Jan 29 17:23:00 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:23:13 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xb331010002000000: rc = -2
Jan 29 17:23:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:24:25 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 1 previous similar message
Jan 29 17:29:52 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x7441010002000000: rc = -2
Jan 29 17:32:15 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages
Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages
Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ]

Pretty sure I should know the answer to this but I'm having a moment here: how do I tell which OSS and mount point a particular OST refers to? I need to know which mountpoint to unmount for echo-OST0028_UUID - thanks

Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ]

Ah looks like the log did have the invalid llog tail and I somehow missed it:

# grep -B1 OST0028 2021-01-29-syslog.log
Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096
Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22

That one wasn't on my previous list so guessing removing that should do the trick. I'll wait for the next available downtime and remove that one.

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40754/
Subject: LU-14098 obdclass: try to skip corrupted llog records
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 910eb97c1b43a44a9da2ae14c3b83e28ca6342fc

Comment by Peter Jones [ 26/Feb/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 26/Jul/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44396
Subject: LU-14098 obdclass: try to skip corrupted llog records
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a761c796b9aecbe90b9692c073e3548ca407efd8

Comment by Gerrit Updater [ 13/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44396/
Subject: LU-14098 obdclass: try to skip corrupted llog records
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3bedd69dbef38f5a3166131831b8c5279df4b0dd

Generated at Sat Feb 10 03:06:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.