Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12577

chlg_load failed to process llog -2 or -5 on client

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.15.0, Lustre 2.12.10
    • Lustre 2.12.0
    • CentOS 7.6, 2.12.0 servers and clients + patches
    • 3
    • 9223372036854775807

    Description

      For the sake of completeness, in addition to LU-11205 Failure to clear the changelog for user 1 on MDT, and seen exclusively on 2.12 on the robinhood node (not with robinhood on 2.10), we hit these errors:

      [root@fir-rbh01 ~]# journalctl -n 10000 -k  | grep llog
      Jun 23 04:49:41 fir-rbh01 kernel: LustreError: 101617:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 24 01:16:36 fir-rbh01 kernel: LustreError: 46868:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 25 19:52:01 fir-rbh01 kernel: LustreError: 45836:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 26 00:32:08 fir-rbh01 kernel: LustreError: 61129:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 27 00:24:51 fir-rbh01 kernel: LustreError: 101382:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jun 27 04:41:23 fir-rbh01 kernel: LustreError: 10896:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 27 06:42:15 fir-rbh01 kernel: LustreError: 22811:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jun 29 14:07:41 fir-rbh01 kernel: LustreError: 94376:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 06:42:48 fir-rbh01 kernel: LustreError: 97545:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 14:25:48 fir-rbh01 kernel: Lustre: 38334:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:50 fir-rbh01 kernel: Lustre: 38337:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:52 fir-rbh01 kernel: Lustre: 38341:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 1 previous similar message
      Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 2 previous similar messages
      Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 4 previous similar messages
      Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 9 previous similar messages
      Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 18 previous similar messages
      Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 37 previous similar messages
      Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
      Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 76 previous similar messages
      Jul 05 17:04:20 fir-rbh01 kernel: LustreError: 61245:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 05 23:47:03 fir-rbh01 kernel: LustreError: 104167:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 06 05:32:40 fir-rbh01 kernel: LustreError: 123618:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      Jul 09 07:57:24 fir-rbh01 kernel: LustreError: 98342:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 09 16:09:56 fir-rbh01 kernel: LustreError: 16125:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 11 19:35:24 fir-rbh01 kernel: LustreError: 7262:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 14 12:55:20 fir-rbh01 kernel: LustreError: 82763:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 16 07:06:50 fir-rbh01 kernel: LustreError: 108216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 16 10:50:17 fir-rbh01 kernel: LustreError: 77729:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 01:06:34 fir-rbh01 kernel: LustreError: 122188:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 14:01:34 fir-rbh01 kernel: LustreError: 90046:0:(llog_cat.c:375:llog_cat_id2handle()) fir-MDT0003-mdc-ffff905fbf58a800: error opening log id [0xc4ff:0x1:0x0]:0: rc = -2
      Jul 21 15:15:39 fir-rbh01 kernel: LustreError: 60198:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 21 17:00:08 fir-rbh01 kernel: LustreError: 42666:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
      Jul 22 14:44:51 fir-rbh01 kernel: LustreError: 88216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
      

      I don't see anything on the MDS beside the usual " Failure to clear the changelog for user 1: -22"

      Attachments

        Issue Links

          Activity

            [LU-12577] chlg_load failed to process llog -2 or -5 on client

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 3cdb4724405b066e370ef1f104198643051c82d8

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/ Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 3cdb4724405b066e370ef1f104198643051c82d8

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 97ce7592978090519683a0465a15e1f4f899a6c3

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847 Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 97ce7592978090519683a0465a15e1f4f899a6c3
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: ae1404feefc1572fdafed938a3fc18131d675678

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/ Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: master Current Patch Set: Commit: ae1404feefc1572fdafed938a3fc18131d675678

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589
            Subject: LU-12577 llog: protect partial updates from readers
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 3da1702e3b73aa425b1c5c2964d41cc01efd659c

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589 Subject: LU-12577 llog: protect partial updates from readers Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 3da1702e3b73aa425b1c5c2964d41cc01efd659c
            jhammond John Hammond added a comment -

            I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned.

            00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875
            8267446528:12345-0@lo:502
            00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21
            00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584
            00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20
            00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528
            00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1)
            00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo
            00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo
            00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472  req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc
            -ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1
            00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo
            00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5
            00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5
            
            jhammond John Hammond added a comment - I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned. 00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875 8267446528:12345-0@lo:502 00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21 00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584 00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20 00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528 00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1) 00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo 00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo 00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472 req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc -ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo 00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5 00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5

            We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed:

            2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory
            

            The Lustre client (also 2.12.5 RC1) had just logged this:

            Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2
            

            I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz ). It looks like there are basic log entries since the start of the server.

             

            sthiell Stephane Thiell added a comment - We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed: 2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory The Lustre client (also 2.12.5 RC1) had just logged this: Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2 I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz  ). It looks like there are basic log entries since the start of the server.  

            People

              bzzz Alex Zhuravlev
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: