[LU-14098] LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Created: 31/Oct/20 Updated: 27/Oct/22 Resolved: 26/Feb/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Campbell Mcleay (Inactive) | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Environment: |
CemtOS 7.6.1810 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Getting the following errors in messages log on MDS: Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages |
| Comments |
| Comment by Andreas Dilger [ 02/Nov/20 ] |
|
Alex, can you please comment - does the "osp_sync_declare_add()) logging isn't available" message mean that OST objects will not be destroyed? Is this affecting only a single OST, or all OSTs? Improving the error message to include the source and target device names would be very helpful. What would be needed to repair the llog file here? I think the LFSCK message is regarding cleaning up orphans? cmcl, could you check if creating and deleting OST objects on each OST is consuming and releasing space as expected? e.g. create a file on each OST with "lfs setstripe -i $ost_idx /path/to/file" and write 1GB there, then delete it and verify with "lfs df" before and after to confirm space is being released. |
| Comment by Campbell Mcleay (Inactive) [ 02/Nov/20 ] |
|
Hi Andreas, Just want to make sure I'm doing the right thing: I can get the $ost_idx from e.g., 'lctl dl' (number in the left hand column) or 'lfs df' (number in the right hand column), and I should run this command on the client, correct? Thanks, Campbell |
| Comment by Andreas Dilger [ 03/Nov/20 ] |
|
Campbell, the OST index is the number on the right for "lfs df" (ie. bravo-OST0016 is OST index 0x16, or 22. It is possible to use either the hex or decimal values with "lfs setstripe -i", which should be run on the client. |
| Comment by Alex Zhuravlev [ 03/Nov/20 ] |
|
adilger yes, that message from osp_sync_declare_add() means that no llog record will be made and orphan will be left for that object. all OSTs are independed in this regard. |
| Comment by Andreas Dilger [ 04/Nov/20 ] |
|
Alex, in the case that the llog is not working, is it possible for the MDS to continue to destroy the OST objects without the llog for recovery? Clearly this would not handle the case when the MDS is rebooted, but it can't be worse than not doing any object destroy operations? |
| Comment by Alex Zhuravlev [ 04/Nov/20 ] |
|
the original idea was to destroy OST object when corresponding unlink is committed. then to save on commit callbacks OSP tracks last committed logid and then, getting records back the llog (they are still in cache usually) OSP issues OST_DESTROY RPC. |
| Comment by Campbell Mcleay (Inactive) [ 04/Nov/20 ] |
|
Currently, there's quite a lot of writes and deletions happening as this cluster catches up on backups. Am waiting for a quiet time when the results will be clear to determine whether space in OSTs is consumed/released as per writes/deletes as expected |
| Comment by Andreas Dilger [ 04/Nov/20 ] |
|
Alex, can you please explain why the llog context is not being initialized properly? I'd think there is something wrong/missing with the llog itself, but I'm not sure how to debug it. |
| Comment by Alex Zhuravlev [ 05/Nov/20 ] |
|
iirc, the most frequent reason - -EINPROGRESS when something went wrong with OI and OSP couldn't open the llog at mounting. |
| Comment by Campbell Mcleay (Inactive) [ 05/Nov/20 ] |
|
Hi Alex, Attached are the logs from the mds bmds1-msgs.txt.gz Kind regards, Campbell |
| Comment by Alex Zhuravlev [ 05/Nov/20 ] |
|
thanks, the llog got corrupted: Oct 30 20:29:06 bmds1 kernel: LustreError: 19089:0:(llog_osd.c:987:llog_osd_next_block()) bravo-MDT0000-osd: invalid llog tail at log id [0x1584b:0x1:0x0]:0 offset 4575232 last_rec idx 1 tail idx 0lrt len 20 read_size 4096 the easiest solution would be to stop that MDT, remove the llog manually and remount again. |
| Comment by Campbell Mcleay (Inactive) [ 05/Nov/20 ] |
|
Hi Alex, detailed instructions would be a great help, thank you. |
| Comment by Alex Zhuravlev [ 08/Nov/20 ] |
|
your MDT should be unmounted, then please follow the example:
# debugfs -c <mdt device> -R "stat /O/1/d11/88139"
...
Inode: 159 Type: regular Mode: 0644 Flags: 0x4000
Generation: 2694950587 Version: 0x00000000:00000000
User: 0 Group: 0 Project: 0 Size: 11520
File ACL: 0
....
Extended attributes:
lma: fid=[0x1:0x1584b:0x0] compat=0 incompat=0
...
at this point please check that fid has 0x1584b in the line starting with lma: fid=... mount -t ldiskfs <mdt device> <mountpoint> rm <mountpoint>/O/1/d11/88139 umount <mountpoint> now you can remount MDT. I'd suggest to run LFSCK to clean orphans. |
| Comment by Campbell Mcleay (Inactive) [ 16/Nov/20 ] |
|
Thanks Alex, will let you know when I have done this. Still waiting for a window |
| Comment by Gerrit Updater [ 25/Nov/20 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40754 |
| Comment by Campbell Mcleay (Inactive) [ 25/Jan/21 ] |
|
Hi all. We're seeing another occurrence of this on another Lustre cluster. We have 6 OSTs that are now showing 100% full. I've used the recommendation on this page: https://wiki.lustre.org/Handling_Full_OSTs to set max_create_count to 0 on those. I've also e2fsck'd the MDT. Here's the main snippet from the syslog: [Jan25 10:18] LustreError: 40160:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK [ +0.001271] LustreError: 40160:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 16497 previous similar messages [Jan25 10:21] LustreError: 4753:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 [ +0.002111] LustreError: 4753:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 76 previous similar messages Given the processes followed above, I've grepped through our syslog for the past while, looking for the invalid llog tail info and see the following: 2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4811:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 bytes 4096 2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4907:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x590f:0x1:0x0]:0 offset 4747264 last_rec idx 1811939328 tail idx 0lrt len 8925011 read_size 4096 2020-11-26-syslog.log:Nov 26 09:25:31 emds1 kernel: LustreError: 4973:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x592c:0x1:0x0]:0 offset 2904064 bytes 4096 2020-12-19-syslog.log:Dec 19 00:25:21 emds1 kernel: LustreError: 4752:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 last_rec idx 2 tail idx 0lrt len 8590435 read_size 4096 2020-12-19-syslog.log:Dec 19 00:25:21 emds1 kernel: LustreError: 4823:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5954:0x1:0x0]:0 offset 626688 bytes 4096 2020-12-19-syslog.log:Dec 19 00:25:22 emds1 kernel: LustreError: 4987:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5934:0x1:0x0]:0 offset 2920448 last_rec idx 0 tail idx 33065lrt len 88 read_size 4096 2021-01-25-syslog.log:Jan 25 13:41:45 emds1 kernel: LustreError: 4492:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5919:0x1:0x0]:0 offset 1708032 bytes 4096 2021-01-25-syslog.log:Jan 25 13:41:45 emds1 kernel: LustreError: 4589:0:(llog_osd.c:988:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x590f:0x1:0x0]:0 offset 4747264 last_rec idx 9787 tail idx 0lrt len 8925011 read_size 4096 Given those entries, would I be right in thinking I should do the following: Convert those 0x59XX values to decimal so: 0x590f is 22799 I'd then run: debugfs -c <mdt device> -R "stat /O/1/d11/<insert decimal number here>" and check the lma fid matches, then delete those files and remount? Please let me know if that's correct or if you require any more info to get a full idea of what's going on. As in the past, we're trying to get a set of good backups through so I'll be trying to wait for those to finish happily before taking the MDT down again. |
| Comment by Campbell Mcleay (Inactive) [ 25/Jan/21 ] |
|
(I should note that following the fsck of the MDT, I have an lfsck running as suggested by the syslog.) |
| Comment by Andreas Dilger [ 26/Jan/21 ] |
|
Campbell, you could try Alex's patch: https://review.whamcloud.com/40754 " However, the process that you describe in your previous comment also looks correct, and may be more expedient for the short term. |
| Comment by Campbell Mcleay (Inactive) [ 27/Jan/21 ] |
|
Thanks Andreas. For now I'm probably gonna go with the simple deletions once the current run of backups are done but we definitely look forward to seeing that patch in an upcoming release. |
| Comment by Campbell Mcleay (Inactive) [ 29/Jan/21 ] |
|
I've removed those log files - they weren't in d11 but neighbour directories - I checked the lma fids on them and they all matched so seemed correct. fsck run on the MDT (which made some changes) and now remounted. Running the lfsck --all -M echo-MDT00010 now. Should we expect to see the space on the OSTs gradually free as the lfsck works its way through things over the next few days? Does it happen in a particular scanning-phase? |
| Comment by Andreas Dilger [ 30/Jan/21 ] |
|
If you don't add an lfsck option to clean up the orphans then it will just report that they exist. If you add "-o" it will link orphan objects under $MOUNT/.lustre/lost+found/MDTxxxx for review and cleanup (i.e. "rm"). See the lctl-lfsck-start(8) man page for details. |
| Comment by Campbell Mcleay (Inactive) [ 01/Feb/21 ] |
|
Would these errors in the MDS syslog be down to the lfsck sorting things out, or is this a separate issue: [ +4.946556] LDISKFS-fs warning: 844 callbacks suppressed [ +0.000008] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2 [ +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem [ +0.003452] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2 [ +0.000005] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem [ +0.002557] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2 [ +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem [ +0.017404] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2 [ +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem [ +0.006475] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2624: inode 769803009: comm lfsck_layout: index 2: reach max htree level 2 [ +0.000004] LDISKFS-fs warning (device md0): ldiskfs_dx_add_entry:2628: Large directory feature is not enabled on this filesystem [ +0.550317] Lustre: 46464:0:(osd_handler.c:501:osd_ldiskfs_add_entry()) echo-MDT0000: directory (inode: 769803009, FID: [0x2000207a3:0x1:0x0]) has reached maximum entry limit [ +0.000008] Lustre: 46464:0:(osd_handler.c:501:osd_ldiskfs_add_entry()) Skipped 7907 previous similar messages |
| Comment by Campbell Mcleay (Inactive) [ 01/Feb/21 ] |
|
Ah - my guess is that this is putting all the ophans in .lustre/lost+found/MDT0000 and after repairing 12M layouts (so far) it can't put any more files in the unstructured lost+found dir. Not sure what effect that's gonna have on the ongoing lfsck. An ls or a find in the lost+found dir hangs so I think my guess is correct. I'm gonna run a find . -exec /bin/rm -f to try and clear those files |
| Comment by Andreas Dilger [ 02/Feb/21 ] |
|
If you are fairly confident that these objects under .lustre/lost+found/MDT0000/ are orphans, then deleting them is the right way forward. |
| Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ] |
|
With the lfsck finished and the find rm finished, 5 of the 6 100% full OSTs are now near empty so I've restored their max_create_count values. One OST is still however 100% full and we're still seeing the following in dmesg: [Feb 2 06:25] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK [ +0.001714] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 34991 previous similar messages I've double checked all the syslogs we have going back a couple of months for occurrences of 'llog tail' and can only find the ones I listed above. Do you have any recommendations for how to go about finding the bad log file? |
| Comment by Andreas Dilger [ 02/Feb/21 ] |
|
Campbell, it would be most useful to check for "LustreError:" messages reported after MDT0000 was mounted when you deleted the llog objects. If you could attach the MDS /var/log/messages file from that time it would be helpful, along with knowing which specific OST is still having the problem. It might have printed something like "echo-OST0003-osc-MDT0000: ctxt 2 lop_setup=0xffff001234567890 failed: rc = -22". Unfortunately, the osp_sync_declare_add() message doesn't report which OST is having the problem. Even so, this function is only trying to get a reference on an existing data structure in memory, so even if it included the OST name it wouldn't be able to print anything about the specific llog file that was having the problem. That would have been loaded at startup time. If the logs from the previous mount don't show anything, it might be possible to re-trigger this by only unmounting and remounting the OST, and capturing the Lustre debug logs on the MDS: oss# umount /mnt/<ost_mount> (or whatever HA command is used to stop the OST) mds# lctl set_param debug=+info+config debug_mb=1024 mds# lctl clear oss# mount /mnt/<ost_mount> (or as appropriate) [wait a few seconds for OST to connect to the MDS] mds# lctl dk /tmp/debug That would be more useful than having to remount the MDT and having the logs from all of the OSTs reconnecting mixed together. Alex, it would be useful to make a small patch (separate from patch https://review.whamcloud.com/40754 " |
| Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ] |
|
The OST in question is: echo-OST0028_UUID I'll paste below the LustreError lines from that day's syslog and I'll attach the whole thing anyway. You'll see an unmount at about 9:50am when I deleted the logs, then a reboot followed by an e4fsck, one more reboot then a 'normal' mount. I'll try the sequence you suggest shortly. Jan 29 00:17:42 emds1 kernel: LustreError: 21600:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 02:51:00 emds1 kernel: LustreError: 4436:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.23.22.113@tcp ns: mdt-echo-MDT0000_UUID lock: ffff9b73b28f33c0/0x3dc1d8a5f466cd2c lrc: 3/0,0 mode: PR/PR res: [0x20002bd3e:0x1728b:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.23.22.113@tcp remote: 0x2f5cecde3bdac8df expref: 440542 pid: 12166 timeout: 306605 lvb_type: 0 Jan 29 02:51:02 emds1 kernel: LustreError: 12229:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.23.22.113@tcp arrived at 1611917462 with bad export cookie 4450076092389702753 Jan 29 08:32:36 emds1 kernel: LustreError: 21599:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 09:46:31 emds1 kernel: LustreError: 137-5: echo-MDT0000_UUID: not available for connect from 10.23.22.103@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 29 09:46:31 emds1 kernel: LustreError: Skipped 15 previous similar messages Jan 29 12:00:37 emds1 kernel: LustreError: 4491:0:(llog_cat.c:395:llog_cat_id2handle()) echo-OST0003-osc-MDT0000: error opening log id [0x5919:0x1:0x0]:0: rc = -2 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 Jan 29 12:00:37 emds1 kernel: LustreError: 3562:0:(mdt_handler.c:6696:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device Jan 29 12:00:37 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:58 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:08 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:18 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 1 previous similar message Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 3 previous similar messages Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 8 previous similar messages Jan 29 14:17:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 32 previous similar messages Jan 29 15:40:44 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4 previous similar messages Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 13 previous similar messages Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 22 previous similar messages Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 33 previous similar messages Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 42 previous similar messages Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1051 previous similar messages Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 289 previous similar messages Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1392 previous similar messages Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2621 previous similar messages Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 776 previous similar messages Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4142 previous similar messages Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 3114 previous similar messages Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x2852010002000000: rc = -2 Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 46 previous similar messages Jan 29 17:22:47 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xfa31010002000000: rc = -2 Jan 29 17:23:00 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:23:13 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xb331010002000000: rc = -2 Jan 29 17:23:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:24:25 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 1 previous similar message Jan 29 17:29:52 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x7441010002000000: rc = -2 Jan 29 17:32:15 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages |
| Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ] |
|
Pretty sure I should know the answer to this but I'm having a moment here: how do I tell which OSS and mount point a particular OST refers to? I need to know which mountpoint to unmount for echo-OST0028_UUID - thanks |
| Comment by Campbell Mcleay (Inactive) [ 02/Feb/21 ] |
|
Ah looks like the log did have the invalid llog tail and I somehow missed it: # grep -B1 OST0028 2021-01-29-syslog.log Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 That one wasn't on my previous list so guessing removing that should do the trick. I'll wait for the next available downtime and remove that one. |
| Comment by Gerrit Updater [ 26/Feb/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40754/ |
| Comment by Peter Jones [ 26/Feb/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 26/Jul/21 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44396 |
| Comment by Gerrit Updater [ 13/Sep/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44396/ |