[LU-12577] chlg_load failed to process llog -2 or -5 on client Created: 23/Jul/19 Updated: 04/Jan/23 Resolved: 22/Jun/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.15.0, Lustre 2.12.10 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stephane Thiell | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LTS12 | ||
| Environment: |
CentOS 7.6, 2.12.0 servers and clients + patches |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
For the sake of completeness, in addition to [root@fir-rbh01 ~]# journalctl -n 10000 -k | grep llog Jun 23 04:49:41 fir-rbh01 kernel: LustreError: 101617:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jun 24 01:16:36 fir-rbh01 kernel: LustreError: 46868:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5 Jun 25 19:52:01 fir-rbh01 kernel: LustreError: 45836:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5 Jun 26 00:32:08 fir-rbh01 kernel: LustreError: 61129:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jun 27 00:24:51 fir-rbh01 kernel: LustreError: 101382:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jun 27 04:41:23 fir-rbh01 kernel: LustreError: 10896:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5 Jun 27 06:42:15 fir-rbh01 kernel: LustreError: 22811:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -5 Jun 29 14:07:41 fir-rbh01 kernel: LustreError: 94376:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 05 06:42:48 fir-rbh01 kernel: LustreError: 97545:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 05 14:25:48 fir-rbh01 kernel: Lustre: 38334:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:25:50 fir-rbh01 kernel: Lustre: 38337:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:25:52 fir-rbh01 kernel: Lustre: 38341:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:25:55 fir-rbh01 kernel: Lustre: 38347:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 1 previous similar message Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:26:01 fir-rbh01 kernel: Lustre: 38358:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 2 previous similar messages Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:26:09 fir-rbh01 kernel: Lustre: 38374:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 4 previous similar messages Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:26:27 fir-rbh01 kernel: Lustre: 38407:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 9 previous similar messages Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:26:59 fir-rbh01 kernel: Lustre: 38470:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 18 previous similar messages Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:28:04 fir-rbh01 kernel: Lustre: 38597:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 37 previous similar messages Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) fir-MDT0002-mdc-ffff905fbf58a800: catlog [0x5:0xa:0x0] crosses index zero Jul 05 14:30:13 fir-rbh01 kernel: Lustre: 38868:0:(llog_cat.c:874:llog_cat_process_or_fork()) Skipped 76 previous similar messages Jul 05 17:04:20 fir-rbh01 kernel: LustreError: 61245:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 05 23:47:03 fir-rbh01 kernel: LustreError: 104167:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 06 05:32:40 fir-rbh01 kernel: LustreError: 123618:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -5 Jul 09 07:57:24 fir-rbh01 kernel: LustreError: 98342:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 09 16:09:56 fir-rbh01 kernel: LustreError: 16125:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 11 19:35:24 fir-rbh01 kernel: LustreError: 7262:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 14 12:55:20 fir-rbh01 kernel: LustreError: 82763:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 16 07:06:50 fir-rbh01 kernel: LustreError: 108216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 16 10:50:17 fir-rbh01 kernel: LustreError: 77729:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0000-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 21 01:06:34 fir-rbh01 kernel: LustreError: 122188:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0001-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 21 14:01:34 fir-rbh01 kernel: LustreError: 90046:0:(llog_cat.c:375:llog_cat_id2handle()) fir-MDT0003-mdc-ffff905fbf58a800: error opening log id [0xc4ff:0x1:0x0]:0: rc = -2 Jul 21 15:15:39 fir-rbh01 kernel: LustreError: 60198:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 21 17:00:08 fir-rbh01 kernel: LustreError: 42666:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0003-mdc-ffff905fbf58a800: fail to process llog: rc = -2 Jul 22 14:44:51 fir-rbh01 kernel: LustreError: 88216:0:(mdc_changelog.c:236:chlg_load()) fir-MDT0002-mdc-ffff905fbf58a800: fail to process llog: rc = -5 I don't see anything on the MDS beside the usual " Failure to clear the changelog for user 1: -22" |
| Comments |
| Comment by Peter Jones [ 23/Jul/19 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Alexander Zarochentsev [ 08/Nov/19 ] |
|
Stephane , can you try a fix " |
| Comment by Stephane Thiell [ 08/Nov/19 ] |
|
Hi Alexander, Since we upgraded to 2.12.3 on October 16, 2019, which includes the patch that you mention, I can only see one occurence on our robinhood server that could be related to this ticket: Nov 06 18:50:08 fir-rbh01 kernel: LustreError: 107119:0:(mdc_changelog.c:249:chlg_load()) fir-MDT0002-mdc-ffff8b66b2630000: fail to process llog: rc = -5 which seems to have stopped the changelog reader of MDT0002. Relaunching Robinhood resumed operation. So my feeling is that it's MUCH better with the patch. |
| Comment by Stephane Thiell [ 07/Jun/20 ] |
|
We have recently upgraded Fir to 2.12.5 RC1 and we still see fail to process llog errors again. -2 actually led to failure of robinhood and changelogs stopped to be processed: 2020/06/06 13:53:17 [86917/2] ChangeLog | Error in llapi_changelog_recv(): -2: No such file or directory The Lustre client (also 2.12.5 RC1) had just logged this: Jun 06 13:53:17 fir-rbh01 kernel: LustreError: 94456:0:(mdc_changelog.c:295:chlg_load()) fir-MDT0002-mdc-ffffa0567bfa8000: fail to process llog: rc = -2 I don't see anything obvious on the MDS side, but I'm attaching the output of lctl dk just in case ( fir-md1-s3_fir-MDT0002_dk_20200607.log.gz
|
| Comment by John Hammond [ 04/May/21 ] |
|
I reproduced this and captured logs. At the start of llog_osd_next_block() the current offset is equal to the size and so the while loop never executes and -EIO is returned. 00000100:00100000:1.0:1620144981.600596:0:13184:0:(service.c:2140:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x169875 8267446528:12345-0@lo:502 00000040:00000040:1.0:1620144981.600602:0:13184:0:(lustre_log.h:367:llog_ctxt_get()) GETting ctxt ffff9d56bccb80c0 : new refcount 21 00000040:00001000:1.0:1620144981.600606:0:13184:0:(llog_osd.c:915:llog_osd_next_block()) looking for log index 21139 (cur idx 21138 off2883584), size 2883584 00000040:00000040:1.0:1620144981.600608:0:13184:0:(lustre_log.h:377:llog_ctxt_put()) PUTting ctxt ffff9d56bccb80c0 : new refcount 20 00010000:00000040:1.0:1620144981.600609:0:13184:0:(ldlm_lib.c:3043:target_committed_to_req()) last_committed 30122610389, transno 0, xid 1698758267446528 00000100:00001000:1.0:1620144981.600611:0:13184:0:(import.c:1865:at_measured()) add 1 to ffff9d57bd413d78 time=61 v=32 (1 1 32 1) 00000100:00000040:1.0:1620144981.600613:0:13184:0:(connection.c:132:ptlrpc_connection_addref()) conn=ffff9d3469391720 refcount 20 to 0@lo 00000100:00000040:1.0:1620144981.600614:0:13184:0:(niobuf.c:57:ptl_send_buf()) peer_id 12345-0@lo 00000100:00000040:1.0:1620144981.600618:0:13184:0:(events.c:161:reply_in_callback()) @@@ reply in flags=0 mlen=8440 offset=192 replen=8472 req@ffff9d39008c9680 x1698758267446528/t0(0) o502->es035-MDT0000-mdc -ffff9d5762fd4000@0@lo:12/10 lens 272/8472 e 0 to 0 dl 1620145027 ref 2 fl Rpc:R/0/ffffffff rc 0/-1 00000100:00000040:1.0:1620144981.600637:0:13184:0:(connection.c:119:ptlrpc_connection_put()) PUT conn=ffff9d3469391720 refcount 19 to 0@lo 00000100:00000040:1.0:1620144981.600639:0:13184:0:(lustre_net.h:2504:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff9d38e36bba80 x1698758267446528/t0(0) o502->1df03e1e-5881-ba22-d5ff-18dab6a0622a@0@lo:341/0 lens 272/8440 e 0 to 0 dl 1620145026 ref 1 fl Interpret:/0/0 rc -5/-5 00000100:00100000:1.0:1620144981.600642:0:13184:0:(service.c:2190:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_002:1df03e1e-5881-ba22-d5ff-18dab6a0622a+9458:1089:x1698758267446528:12345-0@lo:502 Request processed in 46us (79us total) trans 0 rc -5/-5 |
| Comment by Gerrit Updater [ 09/May/21 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/43589 |
| Comment by Gerrit Updater [ 21/Jun/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/43589/ |
| Comment by Peter Jones [ 22/Jun/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 14/Dec/21 ] |
|
"Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/45847 |
| Comment by Gerrit Updater [ 20/Sep/22 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45847/ |