[LU-12577] chlg_load failed to process llog -2 or -5 on client Created: 23/Jul/19  Updated: 04/Jan/23  Resolved: 22/Jun/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.15.0, Lustre 2.12.10

Type: Bug Priority: Major
Reporter: Stephane Thiell Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: LTS12
Environment:

CentOS 7.6, 2.12.0 servers and clients + patches


Attachments: Text File LU-12577-reproducer.patch     File chlg_load.dk.gz     File fir-md1-s3_fir-MDT0002_dk_20200607.log.gz    
Issue Links:
Related
is related to LU-11205 Failure to clear the changelog for us... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

For the sake of completeness, in addition to LU-11205 Failure to clear the changelog for user 1 on MDT, and seen exclusively on 2.12 on the robinhood node (not with robinhood on 2.10), we hit these errors:

[root@fir-rbh01 ~]# journalctl -n 10000 -k  | grep llog
Jun 23 04:49:41 fir-rbh01 kernel: LustreError: 101617:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jun 24 01:16:36 fir-rbh01 kernel: LustreError: 46868:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
Jun 25 19:52:01 fir-rbh01 kernel: LustreError: 45836:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
Jun 26 00:32:08 fir-rbh01 kernel: LustreError: 61129:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jun 27 00:24:51 fir-rbh01 kernel: LustreError: 101382:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jun 27 04:41:23 fir-rbh01 kernel: LustreError: 10896:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5
Jun 27 06:42:15 fir-rbh01 kernel: LustreError: 22811:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -5
Jun 29 14:07:41 fir-rbh01 kernel: LustreError: 94376:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 05 06:42:48 fir-rbh01 kernel: LustreError: 97545:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 05 14:25:48 fir-rbh01 kernel: Lustre: 38334:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:25:50 fir-rbh01 kernel: Lustre: 38337:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:25:52 fir-rbh01 kernel: Lustre: 38341:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 1 previous similar message
Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 2 previous similar messages
Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 4 previous similar messages
Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 9 previous similar messages
Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 18 previous similar messages
Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 37 previous similar messages
Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero
Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 76 previous similar messages
Jul 05 17:04:20 fir-rbh01 kernel: LustreError: 61245:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 05 23:47:03 fir-rbh01 kernel: LustreError: 104167:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 06 05:32:40 fir-rbh01 kernel: LustreError: 123618:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5
Jul 09 07:57:24 fir-rbh01 kernel: LustreError: 98342:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 09 16:09:56 fir-rbh01 kernel: LustreError: 16125:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 11 19:35:24 fir-rbh01 kernel: LustreError: 7262:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 14 12:55:20 fir-rbh01 kernel: LustreError: 82763:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 16 07:06:50 fir-rbh01 kernel: LustreError: 108216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 16 10:50:17 fir-rbh01 kernel: LustreError: 77729:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 21 01:06:34 fir-rbh01 kernel: LustreError: 122188:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 21 14:01:34 fir-rbh01 kernel: LustreError: 90046:0:(llog_cat.c:375:llog_cat_id2handle()) fir-MDT0003-mdc-ffff905fbf58a800: error opening log id [0xc4ff:0x1:0x0]:0: rc = -2
Jul 21 15:15:39 fir-rbh01 kernel: LustreError: 60198:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 21 17:00:08 fir-rbh01 kernel: LustreError: 42666:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2
Jul 22 14:44:51 fir-rbh01 kernel: LustreError: 88216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5

I don't see anything on the MDS beside the usual " Failure to clear the changelog for user 1: -22"



 Comments   
Comment by Peter Jones [ 23/Jul/19 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Alexander Zarochentsev [ 08/Nov/19 ]

Stephane , can you try a fix "LU-11426 llog: changelog records reordering" for the server, and check whether issue still seen?

Comment by Stephane Thiell [ 08/Nov/19 ]

Hi Alexander,

Since we upgraded to 2.12.3 on October 16, 2019, which includes the patch that you mention, I can only see one occurence on our robinhood server that could be related to this ticket:

Nov 06 18:50:08 fir-rbh01 kernel: LustreError: 107119:0:(mdc_changelog.c:249:chlg_load()) fir-MDT0002-mdc-ffff8b66b2630000: fail to process llog: rc = -5

which seems to have stopped the changelog reader of MDT0002. Relaunching Robinhood resumed operation.

So my feeling is that it's MUCH better with the patch.

Comment by Stephane Thiell [ 07/Jun/20 ]

We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed:

2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory

The Lustre client (also 2.12.5 RC1) had just logged this:

Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2

I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz ). It looks like there are basic log entries since the start of the server.

 

Comment by John Hammond [ 04/May/21 ]

I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned.

00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875
8267446528:12345-0@lo:502
00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21
00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584
00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20
00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528
00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1)
00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo
00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo
00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472  req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc
-ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1
00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo
00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5
00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5
Comment by Gerrit Updater [ 09/May/21 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589
Subject: LU-12577 llog: protect partial updates from readers
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3da1702e3b73aa425b1c5c2964d41cc01efd659c

Comment by Gerrit Updater [ 21/Jun/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/
Subject: LU-12577 llog: protect partial updates from readers
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: ae1404feefc1572fdafed938a3fc18131d675678

Comment by Peter Jones [ 22/Jun/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 14/Dec/21 ]

"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847
Subject: LU-12577 llog: protect partial updates from readers
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 97ce7592978090519683a0465a15e1f4f899a6c3

Comment by Gerrit Updater [ 20/Sep/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/
Subject: LU-12577 llog: protect partial updates from readers
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3cdb4724405b066e370ef1f104198643051c82d8

Generated at Sat Feb 10 02:53:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.