[LU-11205] Failure to clear the changelog for user 1 on MDT - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.10.4, Lustre 2.10.6
Labels:
- ORNL
Environment:
CentOS 7.4 (3.10.0-693.2.2.el7_lustre.pl1.x86_64)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hello,

We're seeing the following messages on Oak's MDT in 2.10.4:

Aug 03 09:21:39 oak-md1-s2 kernel: Lustre: 11137:0:(mdd_device.c:1577:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22
Aug 03 09:31:38 oak-md1-s2 kernel: Lustre: 11271:0:(mdd_device.c:1577:mdd_changelog_clear()) oak-MDD0000: Failure to clear the changelog for user 1: -22

Robinhood (also running 2.10.4) shows this:

2018/08/03 10:00:47 [13766/22] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842301) returned -22
2018/08/03 10:00:47 [13766/22] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.
2018/08/03 10:00:47 [13766/16] llapi | cannot purge records for 'cl1'
2018/08/03 10:00:47 [13766/16] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842303) returned -22
2018/08/03 10:00:47 [13766/16] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.
2018/08/03 10:00:47 [13766/4] llapi | cannot purge records for 'cl1'
2018/08/03 10:00:47 [13766/4] ChangeLog | ERROR: llapi_changelog_clear("oak-MDT0000", "cl1", 13975842304) returned -22
2018/08/03 10:00:47 [13766/4] EntryProc | Error -22 performing callback at stage STAGE_CHGLOG_CLR.

Oak's MDT usage is as follow:

[root@oak-md1-s2 ~]# df -h -t lustre
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/md1-rbod1-mdt0  1.3T  131G 1022G  12% /mnt/oak/mdt/0
[root@oak-md1-s2 ~]# df -i -t lustre
Filesystem                    Inodes     IUsed     IFree IUse% Mounted on
/dev/mapper/md1-rbod1-mdt0 873332736 266515673 606817063   31% /mnt/oak/mdt/0

I'm concerned that the MDT might fill up with changelogs. Could you please assist in troubleshooting this issue?
Thanks!
Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

changelog-reader.tgz
0.8 kB
26/Jun/19 12:26 PM
dk_ornl_20190328_1216.gz
4.61 MB
28/Mar/19 4:22 PM
dk_ornl_20190328.gz
4.59 MB
28/Mar/19 3:36 PM
dk.1547747365.gz
8.56 MB
17/Jan/19 6:26 PM
dk.1547747668.gz
8.58 MB
17/Jan/19 6:26 PM
dk.1547828521.gz
7.96 MB
18/Jan/19 4:32 PM
f2_llog_reader_20190328.gz
1.03 MB
28/Mar/19 3:36 PM
lu-11205-ssec.log.gz
647 kB
29/Mar/19 1:37 PM
ornl_0x1_0xd8_0x0.gz
1.19 MB
29/Mar/19 10:04 AM

Issue Links

is related to

LU-12577 chlg_load failed to process llog -2 or -5 on client

Resolved

LU-11426 2/2 Olafs agree: changelog entries are emitted out of order

Resolved

LU-12134 llog_reader (incorrectly?) reports a corrupted changelog

Closed

is related to

LU-12098 changelog_deregister appears not to reliably clear all changelog entries

Resolved

Activity

[LU-11205] Failure to clear the changelog for user 1 on MDT

Stephane Thiell added a comment - 17/Oct/19 4:42 AM

We just upgraded our Lustre 2.12 servers and the Robinhood client to 2.12.3 RC1 and we're still seeing these log messages:

fir-md1-s3: Oct 16 21:40:09 fir-md1-s3 kernel: Lustre: 18584:0:(mdd_device.c:1807:mdd_changelog_clear()) fir-MDD0002: Failure to clear the changelog for user 1: -22

Not sure about the real impact though.

Stephane Thiell added a comment - 17/Oct/19 4:42 AM We just upgraded our Lustre 2.12 servers and the Robinhood client to 2.12.3 RC1 and we're still seeing these log messages: fir-md1-s3: Oct 16 21:40:09 fir-md1-s3 kernel: Lustre: 18584:0:(mdd_device.c:1807:mdd_changelog_clear()) fir-MDD0002: Failure to clear the changelog for user 1: -22 Not sure about the real impact though.

Peter Jones added a comment - 21/Sep/19 4:31 AM

The consensus seems to be that this can be closed as a duplicate of ~~LU-11426~~

Peter Jones added a comment - 21/Sep/19 4:31 AM The consensus seems to be that this can be closed as a duplicate of LU-11426

Qian Yingjin added a comment - 17/Sep/19 9:03 AM

From my understanding of current changelog mechanism from the source code, the chagelog records will not delete if there are more than one changelog user using it.

Qian Yingjin added a comment - 17/Sep/19 9:03 AM From my understanding of current changelog mechanism from the source code, the chagelog records will not delete if there are more than one changelog user using it.

Olaf Weber (Inactive) added a comment - 16/Sep/19 11:28 AM

Agreed that the proposed fix for ~~LU-11426~~ should also resolve this.

Related to the "orphan records" that James mentioned, in my testing of this patch it did not (fully) resolve that issue: the only way I could get them to go away was to rebuild the changelog by deregistering all readers.

Olaf Weber (Inactive) added a comment - 16/Sep/19 11:28 AM Agreed that the proposed fix for LU-11426 should also resolve this. Related to the "orphan records" that James mentioned, in my testing of this patch it did not (fully) resolve that issue: the only way I could get them to go away was to rebuild the changelog by deregistering all readers.

Alexander Boyko added a comment - 16/Sep/19 9:23 AM

~~LU-11426~~ would fix the ordering. So error -22 wouldn't happen.

Alexander Boyko added a comment - 16/Sep/19 9:23 AM LU-11426 would fix the ordering. So error -22 wouldn't happen.

Alexander Boyko added a comment - 08/Aug/19 8:06 AM

@Olaf Weber, the patch allows clear of unordered records. The processing is different case. I do think that software should take care. Read a number of records and process it in order. In most cases it can process unordered too. We should not see unordered operation for a same file or directory cause they are synchronized by a parent lock.

Alexander Boyko added a comment - 08/Aug/19 8:06 AM @Olaf Weber, the patch allows clear of unordered records. The processing is different case. I do think that software should take care. Read a number of records and process it in order. In most cases it can process unordered too. We should not see unordered operation for a same file or directory cause they are synchronized by a parent lock.

Olaf Weber (Inactive) added a comment - 07/Aug/19 1:30 PM

It is hard to say whether it really reduces the lost records issue, as too much changed in the cluster I tested on to make it truly apples/apples.

Olaf Weber (Inactive) added a comment - 07/Aug/19 1:30 PM It is hard to say whether it really reduces the lost records issue, as too much changed in the cluster I tested on to make it truly apples/apples.

James A Simmons added a comment - 07/Aug/19 1:23 PM

Was it better? In our testing it did make the problem go away. I wonder if the problem is less common.

James A Simmons added a comment - 07/Aug/19 1:23 PM Was it better? In our testing it did make the problem go away. I wonder if the problem is less common.

Olaf Weber (Inactive) added a comment - 05/Aug/19 3:40 PM

When I tested the code posted in the review earlier today, I still saw out-of-order and lost changelog records. From this and other tests it does appear that lost changelog records are out-of-order records that are not reliably (and often reliably not) returned to userspace.

Olaf Weber (Inactive) added a comment - 05/Aug/19 3:40 PM When I tested the code posted in the review earlier today, I still saw out-of-order and lost changelog records. From this and other tests it does appear that lost changelog records are out-of-order records that are not reliably (and often reliably not) returned to userspace.

James A Simmons added a comment - 02/Aug/19 2:16 PM

running lfs changelog clear 0 restores the changelog_size back to 4711216. There is some strange issue there but that is independent of this patch. The testing has gone very well!! Thanks Boyko.

James A Simmons added a comment - 02/Aug/19 2:16 PM running lfs changelog clear 0 restores the changelog_size back to 4711216. There is some strange issue there but that is independent of this patch. The testing has gone very well!! Thanks Boyko.

James A Simmons added a comment - 01/Aug/19 4:17 PM

We did some testing and saw these results. Before we started robinhood to purge things we had:

lctl get_param ..changelog_size mdd.f2-tds-MDT0000.changelog_size=4711216

Afterwards we saw

lctl get_param ..changelog_size mdd.f2-tds-MDT0000.changelog_size=928368

Does this patch help resolve some of the orphan records as well that we couldn't cleanup before?

James A Simmons added a comment - 01/Aug/19 4:17 PM We did some testing and saw these results. Before we started robinhood to purge things we had: lctl get_param . .changelog_size mdd.f2-tds-MDT0000.changelog_size=4711216 Afterwards we saw lctl get_param . .changelog_size mdd.f2-tds-MDT0000.changelog_size=928368 Does this patch help resolve some of the orphan records as well that we couldn't cleanup before?

People

Assignee:: Mikhail Pershin

Reporter:: Stephane Thiell

Votes:: 3 Vote for this issue

Watchers:: 26 Start watching this issue

Dates

Created:: 03/Aug/18 5:06 PM

Updated:: 16/Apr/20 1:46 AM

Resolved:: 21/Sep/19 4:31 AM