Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14098

LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8, Lustre 2.15.0
    • Lustre 2.12.5
    • CemtOS 7.6.1810
    • 3
    • 9223372036854775807

    Description

      Getting the following errors in messages log on MDS:

      Oct 30 20:30:37 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
      Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
      Oct 30 20:30:38 bmds1 kernel: LustreError: 19498:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
      Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
      Oct 30 20:30:39 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
      Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
      Oct 30 20:30:41 bmds1 kernel: LustreError: 19516:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 74 previous similar messages
      Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
      Oct 30 20:30:45 bmds1 kernel: LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 233 previous similar messages
      

      Attachments

        Issue Links

          Activity

            [LU-14098] LustreError: 19003:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44396/
            Subject: LU-14098 obdclass: try to skip corrupted llog records
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 3bedd69dbef38f5a3166131831b8c5279df4b0dd

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/44396/ Subject: LU-14098 obdclass: try to skip corrupted llog records Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 3bedd69dbef38f5a3166131831b8c5279df4b0dd

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44396
            Subject: LU-14098 obdclass: try to skip corrupted llog records
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: a761c796b9aecbe90b9692c073e3548ca407efd8

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44396 Subject: LU-14098 obdclass: try to skip corrupted llog records Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: a761c796b9aecbe90b9692c073e3548ca407efd8
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40754/
            Subject: LU-14098 obdclass: try to skip corrupted llog records
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 910eb97c1b43a44a9da2ae14c3b83e28ca6342fc

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40754/ Subject: LU-14098 obdclass: try to skip corrupted llog records Project: fs/lustre-release Branch: master Current Patch Set: Commit: 910eb97c1b43a44a9da2ae14c3b83e28ca6342fc

            Ah looks like the log did have the invalid llog tail and I somehow missed it:

            # grep -B1 OST0028 2021-01-29-syslog.log
            Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096
            Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22

            That one wasn't on my previous list so guessing removing that should do the trick. I'll wait for the next available downtime and remove that one.

            cmcl Campbell Mcleay (Inactive) added a comment - Ah looks like the log did have the invalid llog tail and I somehow missed it: # grep -B1 OST0028 2021-01-29-syslog.log Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 That one wasn't on my previous list so guessing removing that should do the trick. I'll wait for the next available downtime and remove that one.

            Pretty sure I should know the answer to this but I'm having a moment here: how do I tell which OSS and mount point a particular OST refers to? I need to know which mountpoint to unmount for echo-OST0028_UUID - thanks

            cmcl Campbell Mcleay (Inactive) added a comment - Pretty sure I should know the answer to this but I'm having a moment here: how do I tell which OSS and mount point a particular OST refers to? I need to know which mountpoint to unmount for echo-OST0028_UUID - thanks

            The OST in question is: echo-OST0028_UUID

            I'll paste below the LustreError lines from that day's syslog and I'll attach the whole thing anyway. You'll see an unmount at about 9:50am when I deleted the logs, then a reboot followed by an e4fsck, one more reboot then a 'normal' mount.

            I'll try the sequence you suggest shortly.

            Jan 29 00:17:42 emds1 kernel: LustreError: 21600:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110
            Jan 29 02:51:00 emds1 kernel: LustreError: 4436:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.23.22.113@tcp  ns: mdt-echo-MDT0000_UUID lock: ffff9b73b28f33c0/0x3dc1d8a5f466cd2c lrc: 3/0,0 mode: PR/PR res: [0x20002bd3e:0x1728b:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.23.22.113@tcp remote: 0x2f5cecde3bdac8df expref: 440542 pid: 12166 timeout: 306605 lvb_type: 0
            Jan 29 02:51:02 emds1 kernel: LustreError: 12229:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.23.22.113@tcp arrived at 1611917462 with bad export cookie 4450076092389702753
            Jan 29 08:32:36 emds1 kernel: LustreError: 21599:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110
            Jan 29 09:46:31 emds1 kernel: LustreError: 137-5: echo-MDT0000_UUID: not available for connect from 10.23.22.103@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
            Jan 29 09:46:31 emds1 kernel: LustreError: Skipped 15 previous similar messages
            Jan 29 12:00:37 emds1 kernel: LustreError: 4491:0:(llog_cat.c:395:llog_cat_id2handle()) echo-OST0003-osc-MDT0000: error opening log id [0x5919:0x1:0x0]:0: rc = -2
            Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096
            Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
            Jan 29 12:00:37 emds1 kernel: LustreError: 3562:0:(mdt_handler.c:6696:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device
            Jan 29 12:00:37 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:00:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:00:58 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:01:08 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:01:18 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:01:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 1 previous similar message
            Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 3 previous similar messages
            Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28
            Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 8 previous similar messages
            Jan 29 14:17:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 32 previous similar messages
            Jan 29 15:40:44 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4 previous similar messages
            Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages
            Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages
            Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 13 previous similar messages
            Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 22 previous similar messages
            Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 33 previous similar messages
            Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 42 previous similar messages
            Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages
            Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1051 previous similar messages
            Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 289 previous similar messages
            Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1392 previous similar messages
            Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2621 previous similar messages
            Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 776 previous similar messages
            Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4142 previous similar messages
            Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 3114 previous similar messages
            Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x2852010002000000: rc = -2
            Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 46 previous similar messages
            Jan 29 17:22:47 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xfa31010002000000: rc = -2
            Jan 29 17:23:00 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:23:13 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xb331010002000000: rc = -2
            Jan 29 17:23:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:24:25 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 1 previous similar message
            Jan 29 17:29:52 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x7441010002000000: rc = -2
            Jan 29 17:32:15 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages
            Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2
            Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages
            cmcl Campbell Mcleay (Inactive) added a comment - The OST in question is: echo-OST0028_UUID I'll paste below the LustreError lines from that day's syslog and I'll attach the whole thing anyway. You'll see an unmount at about 9:50am when I deleted the logs, then a reboot followed by an e4fsck, one more reboot then a 'normal' mount. I'll try the sequence you suggest shortly. Jan 29 00:17:42 emds1 kernel: LustreError: 21600:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 02:51:00 emds1 kernel: LustreError: 4436:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 10.23.22.113@tcp ns: mdt-echo-MDT0000_UUID lock: ffff9b73b28f33c0/0x3dc1d8a5f466cd2c lrc: 3/0,0 mode: PR/PR res: [0x20002bd3e:0x1728b:0x0].0x0 bits 0x20/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 10.23.22.113@tcp remote: 0x2f5cecde3bdac8df expref: 440542 pid: 12166 timeout: 306605 lvb_type: 0 Jan 29 02:51:02 emds1 kernel: LustreError: 12229:0:(ldlm_lockd.c:2324:ldlm_cancel_handler()) ldlm_cancel from 10.23.22.113@tcp arrived at 1611917462 with bad export cookie 4450076092389702753 Jan 29 08:32:36 emds1 kernel: LustreError: 21599:0:(upcall_cache.c:233:upcall_cache_get_entry()) acquire for key 463: error -110 Jan 29 09:46:31 emds1 kernel: LustreError: 137-5: echo-MDT0000_UUID: not available for connect from 10.23.22.103@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. Jan 29 09:46:31 emds1 kernel: LustreError: Skipped 15 previous similar messages Jan 29 12:00:37 emds1 kernel: LustreError: 4491:0:(llog_cat.c:395:llog_cat_id2handle()) echo-OST0003-osc-MDT0000: error opening log id [0x5919:0x1:0x0]:0: rc = -2 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(llog_osd.c:1000:llog_osd_next_block()) echo-MDT0000-osd: invalid llog tail at log id [0x5938:0x1:0x0]:0 offset 2625536 bytes 4096 Jan 29 12:00:37 emds1 kernel: LustreError: 4566:0:(osp_sync.c:1273:osp_sync_thread()) echo-OST0028-osc-MDT0000: llog process with osp_sync_process_queues failed: -22 Jan 29 12:00:37 emds1 kernel: LustreError: 3562:0:(mdt_handler.c:6696:mdt_iocontrol()) echo-MDT0000: Aborting recovery for device Jan 29 12:00:37 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:00:58 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:08 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:18 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:01:48 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 1 previous similar message Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:02:28 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 3 previous similar messages Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) echo-OST0003-osc-MDT0000: can't precreate: rc = -28 Jan 29 12:03:38 emds1 kernel: LustreError: 4492:0:(osp_precreate.c:686:osp_precreate_send()) Skipped 8 previous similar messages Jan 29 14:17:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 14:17:51 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 32 previous similar messages Jan 29 15:40:44 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:45 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4 previous similar messages Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:46 emds1 kernel: LustreError: 20674:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2 previous similar messages Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:49 emds1 kernel: LustreError: 17802:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:40:53 emds1 kernel: LustreError: 14628:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 13 previous similar messages Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:02 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 22 previous similar messages Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:41:18 emds1 kernel: LustreError: 11297:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 33 previous similar messages Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:42:19 emds1 kernel: LustreError: 5958:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 42 previous similar messages Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:46:51 emds1 kernel: LustreError: 13249:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 6 previous similar messages Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:48:59 emds1 kernel: LustreError: 19955:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1051 previous similar messages Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 15:59:21 emds1 kernel: LustreError: 19951:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 289 previous similar messages Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:13:55 emds1 kernel: LustreError: 19940:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 1392 previous similar messages Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:30:33 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 2621 previous similar messages Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:41:48 emds1 kernel: LustreError: 34704:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 776 previous similar messages Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 16:54:22 emds1 kernel: LustreError: 34500:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 4142 previous similar messages Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK Jan 29 17:05:40 emds1 kernel: LustreError: 37541:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 3114 previous similar messages Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x2852010002000000: rc = -2 Jan 29 17:22:30 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 46 previous similar messages Jan 29 17:22:47 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xfa31010002000000: rc = -2 Jan 29 17:23:00 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:23:13 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0xb331010002000000: rc = -2 Jan 29 17:23:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:24:25 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:26:21 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 1 previous similar message Jan 29 17:29:52 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x7441010002000000: rc = -2 Jan 29 17:32:15 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:35:50 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) srv-echo-MDT0000: Cannot find sequence 0x6977010002000000: rc = -2 Jan 29 17:40:28 emds1 kernel: LustreError: 15777:0:(fld_handler.c:264:fld_server_lookup()) Skipped 2 previous similar messages

            Campbell, it would be most useful to check for "LustreError:" messages reported after MDT0000 was mounted when you deleted the llog objects. If you could attach the MDS /var/log/messages file from that time it would be helpful, along with knowing which specific OST is still having the problem. It might have printed something like "echo-OST0003-osc-MDT0000: ctxt 2 lop_setup=0xffff001234567890 failed: rc = -22".

            Unfortunately, the osp_sync_declare_add() message doesn't report which OST is having the problem. Even so, this function is only trying to get a reference on an existing data structure in memory, so even if it included the OST name it wouldn't be able to print anything about the specific llog file that was having the problem. That would have been loaded at startup time. If the logs from the previous mount don't show anything, it might be possible to re-trigger this by only unmounting and remounting the OST, and capturing the Lustre debug logs on the MDS:

            oss# umount /mnt/<ost_mount> (or whatever HA command is used to stop the OST)
            mds# lctl set_param debug=+info+config debug_mb=1024
            mds# lctl clear
            oss# mount /mnt/<ost_mount> (or as appropriate)
            [wait a few seconds for OST to connect to the MDS]
            mds# lctl dk /tmp/debug
            

            That would be more useful than having to remount the MDT and having the logs from all of the OSTs reconnecting mixed together.

            Alex, it would be useful to make a small patch (separate from patch https://review.whamcloud.com/40754 "LU-14098 obdclass: try to skip corrupted llog records") that improves the error message in osp_sync_declare_add() - add in the MDT name (at the start) and OST device name (in the message). That would at least allow debugging which OST (and which MDT in DNE mode) the problem is related to. It looks like the 40754 patch does not touch this function, so there wouldn't be any conflicts.

            adilger Andreas Dilger added a comment - Campbell, it would be most useful to check for " LustreError: " messages reported after MDT0000 was mounted when you deleted the llog objects. If you could attach the MDS /var/log/messages file from that time it would be helpful, along with knowing which specific OST is still having the problem. It might have printed something like " echo-OST0003-osc-MDT0000: ctxt 2 lop_setup=0xffff001234567890 failed: rc = -22 ". Unfortunately, the osp_sync_declare_add() message doesn't report which OST is having the problem. Even so, this function is only trying to get a reference on an existing data structure in memory, so even if it included the OST name it wouldn't be able to print anything about the specific llog file that was having the problem. That would have been loaded at startup time. If the logs from the previous mount don't show anything, it might be possible to re-trigger this by only unmounting and remounting the OST, and capturing the Lustre debug logs on the MDS: oss# umount /mnt/<ost_mount> (or whatever HA command is used to stop the OST) mds# lctl set_param debug=+info+config debug_mb=1024 mds# lctl clear oss# mount /mnt/<ost_mount> (or as appropriate) [wait a few seconds for OST to connect to the MDS] mds# lctl dk /tmp/debug That would be more useful than having to remount the MDT and having the logs from all of the OSTs reconnecting mixed together. Alex, it would be useful to make a small patch (separate from patch https://review.whamcloud.com/40754 " LU-14098 obdclass: try to skip corrupted llog records ") that improves the error message in osp_sync_declare_add() - add in the MDT name (at the start) and OST device name (in the message). That would at least allow debugging which OST (and which MDT in DNE mode) the problem is related to. It looks like the 40754 patch does not touch this function, so there wouldn't be any conflicts.

            With the lfsck finished and the find rm finished, 5 of the 6 100% full OSTs are now near empty so I've restored their max_create_count values. One OST is still however 100% full and we're still seeing the following in dmesg:

            [Feb 2 06:25] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK
            [  +0.001714] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 34991 previous similar messages

            I've double checked all the syslogs we have going back a couple of months for occurrences of 'llog tail' and can only find the ones I listed above. Do you have any recommendations for how to go about finding the bad log file?

            cmcl Campbell Mcleay (Inactive) added a comment - With the lfsck finished and the find rm finished, 5 of the 6 100% full OSTs are now near empty so I've restored their max_create_count values. One OST is still however 100% full and we're still seeing the following in dmesg: [Feb 2 06:25] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) logging isn't available, run LFSCK [ +0.001714] LustreError: 20636:0:(osp_sync.c:350:osp_sync_declare_add()) Skipped 34991 previous similar messages I've double checked all the syslogs we have going back a couple of months for occurrences of 'llog tail' and can only find the ones I listed above. Do you have any recommendations for how to go about finding the bad log file?

            If you are fairly confident that these objects under .lustre/lost+found/MDT0000/ are orphans, then deleting them is the right way forward.

            adilger Andreas Dilger added a comment - If you are fairly confident that these objects under .lustre/lost+found/MDT0000/ are orphans, then deleting them is the right way forward.

            People

              bzzz Alex Zhuravlev
              cmcl Campbell Mcleay (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: