Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.5
-
CemtOS 7.6.1810
-
3
-
9223372036854775807
Description
Getting the following errors in messages log on MDS:
Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages
Attachments
- 2021-01-29-syslog.log.gz
- 208 kB
- bmds1-msgs.txt.gz
- 211 kB
Issue Links
Activity
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44396
Subject: LU-14098 obdclass: try to skip corrupted llog records
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: a761c796b9aecbe90b9692c073e3548ca407efd8
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40754/
Subject: LU-14098 obdclass: try to skip corrupted llog records
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 910eb97c1b43a44a9da2ae14c3b83e28ca6342fc
Ah looks like the log did have the invalid llog tail and I somehow missed it:
# grep -B1 OST0028 2021-01-29-syslog.log Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
That one wasn't on my previous list so guessing removing that should do the trick. I'll wait for the next available downtime and remove that one.
Pretty sure I should know the answer to this but I'm having a moment here: how do I tell which OSS and mount point a particular OST refers to? I need to know which mountpoint to unmount for echo-OST0028_UUID - thanks
The OST in question is: echo-OST0028_UUID
I'll paste below the LustreError lines from that day's syslog and I'll attach the whole thing anyway. You'll see an unmount at about 9:50am when I deleted the logs, then a reboot followed by an e4fsck, one more reboot then a 'normal' mount.
I'll try the sequence you suggest shortly.
Jan 29 00:17:42 emds1 kernel: LustreError: 21600:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 02:51:00 emds1 kernel: LustreError: 4436:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.23.22.113@tcp ns: mdt-echo-MDT0000_UUID lock: ffff9b73b28f33c0/0x3dc1d8a5f466cd2c lrc: 3/0,0 mode: PR/PR res: [0x20002bd3e:0x1728b:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.23.22.113@tcp remote: 0x2f5cecde3bdac8df expref: 440542 pid: 12166 timeout: 306605 lvb_type: 0 Jan 29 02:51:02 emds1 kernel: LustreError: 12229:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.23.22.113@tcp arrived at 1611917462 with bad export cookie 4450076092389702753 Jan 29 08:32:36 emds1 kernel: LustreError: 21599:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 09:46:31 emds1 kernel: LustreError: 137-5: echo-MDT0000_UUID: not available for connect from 10.23.22.103@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 29 09:46:31 emds1 kernel: LustreError: Skipped 15 previous similar messages Jan 29 12:00:37 emds1 kernel: LustreError: 4491:0:(llog_cat.c:395:llog_cat_id2handle()) echo-OST0003-osc-MDT0000: error opening log id [0x5919:0x1:0x0]:0: rc = -2 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 Jan 29 12:00:37 emds1 kernel: LustreError: 3562:0:(mdt_handler.c:6696:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device Jan 29 12:00:37 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:58 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:08 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:18 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 1 previous similar message Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 3 previous similar messages Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 8 previous similar messages Jan 29 14:17:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 32 previous similar messages Jan 29 15:40:44 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4 previous similar messages Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 13 previous similar messages Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 22 previous similar messages Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 33 previous similar messages Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 42 previous similar messages Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1051 previous similar messages Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 289 previous similar messages Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1392 previous similar messages Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2621 previous similar messages Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 776 previous similar messages Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4142 previous similar messages Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 3114 previous similar messages Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x2852010002000000: rc = -2 Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 46 previous similar messages Jan 29 17:22:47 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xfa31010002000000: rc = -2 Jan 29 17:23:00 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:23:13 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xb331010002000000: rc = -2 Jan 29 17:23:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:24:25 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 1 previous similar message Jan 29 17:29:52 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x7441010002000000: rc = -2 Jan 29 17:32:15 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages
Campbell, it would be most useful to check for "LustreError:" messages reported after MDT0000 was mounted when you deleted the llog objects. If you could attach the MDS /var/log/messages file from that time it would be helpful, along with knowing which specific OST is still having the problem. It might have printed something like "echo-OST0003-osc-MDT0000: ctxt 2 lop_setup=0xffff001234567890 failed: rc = -22".
Unfortunately, the osp_sync_declare_add() message doesn't report which OST is having the problem. Even so, this function is only trying to get a reference on an existing data structure in memory, so even if it included the OST name it wouldn't be able to print anything about the specific llog file that was having the problem. That would have been loaded at startup time. If the logs from the previous mount don't show anything, it might be possible to re-trigger this by only unmounting and remounting the OST, and capturing the Lustre debug logs on the MDS:
oss# umount /mnt/<ost_mount> (or whatever HA command is used to stop the OST) mds# lctl set_param debug=+info+config debug_mb=1024 mds# lctl clear oss# mount /mnt/<ost_mount> (or as appropriate) [wait a few seconds for OST to connect to the MDS] mds# lctl dk /tmp/debug
That would be more useful than having to remount the MDT and having the logs from all of the OSTs reconnecting mixed together.
Alex, it would be useful to make a small patch (separate from patch https://review.whamcloud.com/40754 "LU-14098 obdclass: try to skip corrupted llog records") that improves the error message in osp_sync_declare_add() - add in the MDT name (at the start) and OST device name (in the message). That would at least allow debugging which OST (and which MDT in DNE mode) the problem is related to. It looks like the 40754 patch does not touch this function, so there wouldn't be any conflicts.
With the lfsck finished and the find rm finished, 5 of the 6 100% full OSTs are now near empty so I've restored their max_create_count values. One OST is still however 100% full and we're still seeing the following in dmesg:
[Feb 2 06:25] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK [ +0.001714] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 34991 previous similar messages
I've double checked all the syslogs we have going back a couple of months for occurrences of 'llog tail' and can only find the ones I listed above. Do you have any recommendations for how to go about finding the bad log file?
If you are fairly confident that these objects under .lustre/lost+found/MDT0000/ are orphans, then deleting them is the right way forward.
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44396/
Subject:
LU-14098obdclass: try to skip corrupted llog recordsProject: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3bedd69dbef38f5a3166131831b8c5279df4b0dd