Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12577

chlg_load failed to process llog -2 or -5 on client

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0, Lustre 2.12.10
    • Lustre 2.12.0
    • CentOS 7.6, 2.12.0 servers and clients + patches
    • 3
    • 9223372036854775807

    Description

      For the sake of completeness, in addition to LU-11205 Failure to clear the changelog for user 1 on MDT, and seen exclusively on 2.12 on the robinhood node (not with robinhood on 2.10), we hit these errors:

      [root@fir-rbh01 ~]# journalctl -n 10000 -k  | grep llog
      Jun 23 04:49:41 fir-rbh01 kernel: LustreError: 101617:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 24 01:16:36 fir-rbh01 kernel: LustreError: 46868:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 25 19:52:01 fir-rbh01 kernel: LustreError: 45836:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 26 00:32:08 fir-rbh01 kernel: LustreError: 61129:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 27 00:24:51 fir-rbh01 kernel: LustreError: 101382:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 27 04:41:23 fir-rbh01 kernel: LustreError: 10896:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 27 06:42:15 fir-rbh01 kernel: LustreError: 22811:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 29 14:07:41 fir-rbh01 kernel: LustreError: 94376:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 06:42:48 fir-rbh01 kernel: LustreError: 97545:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 14:25:48 fir-rbh01 kernel: Lustre: 38334:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:50 fir-rbh01 kernel: Lustre: 38337:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:52 fir-rbh01 kernel: Lustre: 38341:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 1 previous similar message
      Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 2 previous similar messages
      Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 4 previous similar messages
      Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 9 previous similar messages
      Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 18 previous similar messages
      Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 37 previous similar messages
      Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 76 previous similar messages
      Jul 05 17:04:20 fir-rbh01 kernel: LustreError: 61245:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 23:47:03 fir-rbh01 kernel: LustreError: 104167:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 06 05:32:40 fir-rbh01 kernel: LustreError: 123618:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jul 09 07:57:24 fir-rbh01 kernel: LustreError: 98342:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 09 16:09:56 fir-rbh01 kernel: LustreError: 16125:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 11 19:35:24 fir-rbh01 kernel: LustreError: 7262:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 14 12:55:20 fir-rbh01 kernel: LustreError: 82763:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 16 07:06:50 fir-rbh01 kernel: LustreError: 108216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 16 10:50:17 fir-rbh01 kernel: LustreError: 77729:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 01:06:34 fir-rbh01 kernel: LustreError: 122188:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 14:01:34 fir-rbh01 kernel: LustreError: 90046:0:(llog_cat.c:375:llog_cat_id2handle()) fir-MDT0003-mdc-ffff905fbf58a800: error opening log id [0xc4ff:0x1:0x0]:0: rc = -2
      Jul 21 15:15:39 fir-rbh01 kernel: LustreError: 60198:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 17:00:08 fir-rbh01 kernel: LustreError: 42666:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 22 14:44:51 fir-rbh01 kernel: LustreError: 88216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      

      I don't see anything on the MDS beside the usual " Failure to clear the changelog for user 1: -22"

      Attachments

        Issue Links

          Activity

            [LU-12577] chlg_load failed to process llog -2 or -5 on client

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 3cdb4724405b066e370ef1f104198643051c82d8

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/ Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 3cdb4724405b066e370ef1f104198643051c82d8

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 97ce7592978090519683a0465a15e1f4f899a6c3

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847 Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 97ce7592978090519683a0465a15e1f4f899a6c3
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ae1404feefc1572fdafed938a3fc18131d675678

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/ Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: master Current Patch Set: Commit: ae1404feefc1572fdafed938a3fc18131d675678

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3da1702e3b73aa425b1c5c2964d41cc01efd659c

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589 Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3da1702e3b73aa425b1c5c2964d41cc01efd659c
            jhammond John Hammond added a comment -

            I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned.

            00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875
            8267446528:12345-0@lo:502
            00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21
            00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584
            00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20
            00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528
            00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1)
            00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo
            00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo
            00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472  req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc
            -ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1
            00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo
            00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5
            00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5
            
            jhammond John Hammond added a comment - I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned. 00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875 8267446528:12345-0@lo:502 00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21 00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584 00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20 00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528 00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1) 00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo 00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo 00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472 req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc -ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo 00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5 00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5

            We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed:

            2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory
            

            The Lustre client (also 2.12.5 RC1) had just logged this:

            Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2
            

            I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz ). It looks like there are basic log entries since the start of the server.

             

            sthiell Stephane Thiell added a comment - We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed: 2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory The Lustre client (also 2.12.5 RC1) had just logged this: Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2 I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz  ). It looks like there are basic log entries since the start of the server.  

            Hi Alexander,

            Since we upgraded to 2.12.3 on October 16, 2019, which includes the patch that you mention, I can only see one occurence on our robinhood server that could be related to this ticket:

            Nov 06 18:50:08 fir-rbh01 kernel: LustreError: 107119:0:(mdc_changelog.c:249:chlg_load()) fir-MDT0002-mdc-ffff8b66b2630000: fail to process llog: rc = -5
            

            which seems to have stopped the changelog reader of MDT0002. Relaunching Robinhood resumed operation.

            So my feeling is that it's MUCH better with the patch.

            sthiell Stephane Thiell added a comment - Hi Alexander, Since we upgraded to 2.12.3 on October 16, 2019, which includes the patch that you mention, I can only see one occurence on our robinhood server that could be related to this ticket: Nov 06 18:50:08 fir-rbh01 kernel: LustreError: 107119:0:(mdc_changelog.c:249:chlg_load()) fir-MDT0002-mdc-ffff8b66b2630000: fail to process llog: rc = -5 which seems to have stopped the changelog reader of MDT0002. Relaunching Robinhood resumed operation. So my feeling is that it's MUCH better with the patch.
            zam Alexander Zarochentsev added a comment - - edited

            Stephane , can you try a fix "LU-11426 llog: changelog records reordering" for the server, and check whether issue still seen?

            zam Alexander Zarochentsev added a comment - - edited Stephane , can you try a fix " LU-11426 llog: changelog records reordering" for the server, and check whether issue still seen?
            pjones Peter Jones added a comment -

            Mike

            Could you please advise?

            Thanks

            Peter

            pjones Peter Jones added a comment - Mike Could you please advise? Thanks Peter

            People

              bzzz Alex Zhuravlev
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: