[LU-12098] changelog_deregister appears not to reliably clear all changelog entries Created: 22/Mar/19  Updated: 15/Jul/19  Resolved: 21/May/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.6
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Olaf Faaland Assignee: Sebastien Buisson
Resolution: Fixed Votes: 0
Labels: ORNL, llnl
Environment:

Lustre 2.10.6_2.chaos
zfs 0.7.11-5llnl
distro toss-release-3.4-4 (RHEL 7.6 based)
kernel 3.10.0-957.5.1.3chaos.ch6.x86_64
2 MDTs, we use DNE1
See github.com/llnl/ for zfs and lustre code corresponding to these tags


Issue Links:
Related
is related to LU-11205 Failure to clear the changelog for us... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After lctl changelog_deregister, changelogs were not fully cleared:

[root@porteri:~]# pdsh -g mds 'lctl get_param "*.*.changelog_users"' | dshbak -c
----------------
eporter81
----------------
mdd.lustre3-MDT0000.changelog_users=current index: 3856956771
ID    index
----------------
eporter82
----------------
mdd.lustre3-MDT0001.changelog_users=current index: 5663390742
ID    index
[root@porteri:~]# pdsh -g mds 'lctl get_param "*.*.changelog_size"' | dshbak -c
----------------
eporter81
----------------
mdd.lustre3-MDT0000.changelog_size=15384080
----------------
eporter82
----------------
mdd.lustre3-MDT0001.changelog_size=25036880872


 Comments   
Comment by Olaf Faaland [ 22/Mar/19 ]

For our tracking, our internal ticket is TOSS-4471

Comment by Olaf Faaland [ 22/Mar/19 ]

I have not yet been able to test to see if it can be produced in master or 2.12. I will attempt that next week.

Comment by Olaf Faaland [ 22/Mar/19 ]

In the case of one of MDT0001, the 'lctl deregister_changelog' command was issued and the MDS crashed. However, MDT0000 never went down.

Comment by Olaf Faaland [ 22/Mar/19 ]

We can work around this in the future, on Lustre 2.10, by issuing an "lfs changelog_clear" first. For these MDTs, should I re-register a changelog user and then issue changelog_clear and changelog_deregister?

Comment by Peter Jones [ 22/Mar/19 ]

Sébastien

Could you please advise here?

Thanks

Peter

Comment by Sebastien Buisson [ 25/Mar/19 ]

Hi,

My understanding is that changelog_deregister is supposed to issue a clear first, so there should be no difference if clear is not called before. And I am not sure registering a new changelog user to issue a clear and then deregister would help, because the newly created changelog user will have its index pointing to the latest existing index on the MDT, hence preventing to clear older indexes.

The only way I can see to clear those remaining changelog records is to stop then restart the MDTs. Indeed, upon stop, the changelog record files are completely cleaned up.

In your case, there might be a bug which prevents former llog files from being deleted while MDT is online. Changelogs records are split into multiple llog files on MDTs, each one containing up to around 16k records. Once an llog file becomes full (ie contains ~16k records), an additional llog file is created to store the next Changelog records. When a clear is issued for index N, all llog files for which maximum record index is lower than N are removed. Only remain the llog files that have records with index greater than or equal to N. If the llog file containing record N is the current, latest one, then it cannot be removed, because it is being used. No matter how many records are in there, the space consumed on the MDT by this llog file will be freed later, when a new clear is issued.

In order to investigate, could you please access the impacted MDTs as ZFS, and run the following command to help finding the actual location of the changelog entries on the MDTs:

# llog_reader /zfs_mount/changelog_catalog

It should output something like this:

Header size : 8192
Time : Mon Mar 25 15:36:53 2019
Number of records: 4
Target uuid :
-----------------------
#01 (064)id=[0x6:0x1:0x0]:0 path=oi.1/0x1:0x6:0x0
#02 (064)id=[0x7:0x1:0x0]:0 path=oi.1/0x1:0x7:0x0
#03 (064)id=[0x8:0x1:0x0]:0 path=oi.1/0x1:0x8:0x0
#04 (064)id=[0x9:0x1:0x0]:0 path=oi.1/0x1:0x9:0x0

What is important is the number of records (4 in this example), and the path to every llog file holding changelog records. In this example the files are oi.1/0x1:0x{6,7,8,9}:0x0.

Then, for each of these files, it would be interesting to see how many records they contain:

# for i in 6 7 8 9; do llog_reader /zfs_mount/oi.1/0x1:0x${i}:0x0 | grep records ; done
Number of records: 16577
Number of records: 16373
Number of records: 16320
Number of records: 10731

Thanks.

Comment by Olaf Faaland [ 27/Mar/19 ]

My understanding is that changelog_deregister is supposed to issue a clear first, so there should be no difference if clear is not called before.

That is my understanding too, but I believe the changelog_deregister first disables the changelog user. So if the MDT goes down while the changelog_deregister is occurring, how could one resume the clearing of the changelog records? I believe that they cannot be cleared at that point, at least by the user. But I am not certain of that.

On the other hand, I believe the changelog_clear can just be invoked again after the MDT comes back up.

I haven't had a machine to test either theory on, nor time to just walk the code to see, yet.

Indeed, upon stop, the changelog record files are completely cleaned up.

Do you recall where in the code this occurs? Is this for empty changelog record files (Number of records: 0) or for orphaned records like the ones we have?

Then, for each of these files, it would be interesting to see how many records they contain:

I verified all the changelog_catalog entries point to existing files, and all those files contain changelog records. I'll post more details later today.

Comment by Olaf Faaland [ 28/Mar/19 ]

Here is a summary for each of the two MDTs. I counted the number of records of type "changelog" found within the files the changelog_catalog points to. MIN is the minimum number of records seen after looking at all the files; MAX is same except maximum number, and AVG and TOT are average and total (sum).

Note that porter81 is hosting MDT 0 and porter82 is hosting MDT 1

----------------
eporter81
----------------
Changelog_catalog header:
Header size : 8192
Number of records: 2

Missing changelog files:  0
Existing changelog files:  2
checking oi.1/0x1:0x12b4a:0x0
checking oi.1/0x1:0x12b4b:0x0
all files checked

MAX 59712 MIN 17606 TOT 77318 AVG 38659
----------------
eporter82
----------------
Changelog_catalog header:
Header size : 8192
Number of records: 2911

Missing changelog files:  0
Existing changelog files:  2911
checking oi.1/0x1:0x2e14f:0x0
checking oi.1/0x1:0x2e150:0x0
<... 2909 rows removed ...>
all files checked

MAX 64101 MIN 11590 TOT 185533101 AVG 63735
Comment by Olaf Faaland [ 28/Mar/19 ]

Hi Sebastien,
I tried stopping MDT 1 and then starting it again. changelog_size has not changed.

Comment by Sebastien Buisson [ 28/Mar/19 ]

Hi,

changelog_deregister calls mdd_changelog_user_purge(). This one will invoke mdd_changelog_user_purge_cb(), which is responsible for (a) cancelling changelog records with indexes lower than that of the changelog user if not used by other user and (b) unregistering this changelog user. After that, mdd_changelog_user_purge() calls mdd_changelog_llog_cancel() to (c) purge changelog files if necessary.

So you are right, if the MDS crashes in-between (b) and (c), you can end up with cancelled changelogs that are not freed.

By introducing a new fail_loc for test purpose, I was able to make my MDS crash between step (b) and step (c). After reboot, the changelog user was not registered any more, but all its changelog entries were still on the MDT. So I registered a new changelog user, then immediately deregistered it, and it successfully cleared the pending changelog entries.

It confirm that this technique is valid to get rid of remaining changelog entries.
It does not work to call changelog_clear after the MDS comes back up, because the changelog user is not known anymore by the MDS:

# lfs changelog_clear lustre-MDT0001 cl1 0
lfs changelog_clear: cannot purge records for 'cl1': No such file or directory (2)
changelog_clear: no changelog user: cl1
Comment by Sebastien Buisson [ 29/Mar/19 ]

>> Indeed, upon stop, the changelog record files are completely cleaned up.
> Do you recall where in the code this occurs? Is this for empty changelog record files (Number of records: 0) or for orphaned records like the ones we have?

The code path is the following:

mdd_device_shutdown
    mdd_changelog_fini
        llog_cat_close
           llog_destroy

So when an MDT is unmounted, llog_destroy is called for Changelog files that are referenced in a catalog file, and for which the number of records they contain is set to 0.
It is not a cleanup of orphan objects.

Comment by Sebastien Buisson [ 29/Mar/19 ]

The information you gave with 'lctl get_param "..changelog_users"' and 'lctl get_param "..changelog_size"' on the one hand, and the information contained in the catalog and changelogs files (number of records, etc) on the other hand, is consistent with what I have when using my reproducer that crashes between steps (b) and (c).

To sum up the situation, no more changelog users are registered, but the catalog file points to valid changelog files that are full of entries ('Number of records:' is not 0).

On my test system I was able to get rid of these remaining changelogs not used anymore by any user by registering a new changelogs user and then deregistering it.
Before registering:

# lctl get_param mdd.seb-MDT000*.changelog_size
mdd.seb-MDT0000.changelog_size=0
mdd.seb-MDT0001.changelog_size=13186960

After de-registering:

# lctl get_param mdd.seb-MDT000*.changelog_size
mdd.seb-MDT0000.changelog_size=0
mdd.seb-MDT0001.changelog_size=25384
Comment by Olaf Faaland [ 31/Mar/19 ]

Sebastien,
Thank you. I'll do that next week, probably on Wednesday.

Comment by Olaf Faaland [ 10/Apr/19 ]

Hi Sebastien,
On both the affected file systems, the register/de-register procedure worked. Thank you.

Are you planning to re-order the de-register code to prevent this problem?

Comment by Sebastien Buisson [ 11/Apr/19 ]

Hi Olaf,

This is very good news!

Regarding the way to solve this issue, the current implementation might be imposed by concurrency and/or deadlock concerns. So I was thinking about modifying 'changelog_deregister' so that it calls internally the 'changelog_clear' routines before actually proceeding to the consumer deregister. This way, if a crash occurs during the process, we could not end up with changelog entries not cleared but changelog consumer deregistered.

I will try to push a patch for this.

Thanks,
Sebastien.

Comment by Gerrit Updater [ 16/Apr/19 ]

Sebastien Buisson (sbuisson@ddn.com) uploaded a new patch: https://review.whamcloud.com/34688
Subject: LU-12098 mdd: explicitly clear changelogs on deregister
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 21bd3ab61062d5c06a3a231ff7dbadb44dfb337a

Comment by Gerrit Updater [ 21/May/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34688/
Subject: LU-12098 mdd: explicitly clear changelogs on deregister
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 83ffa859bc629e246de9fcdfc82838b14c6d0ea3

Comment by Peter Jones [ 21/May/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 21/May/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34921
Subject: LU-12098 mdd: explicitly clear changelogs on deregister
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 7904fb7e243042bce0fad2f2ac9f9fef33236e58

Comment by Gerrit Updater [ 08/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34921/
Subject: LU-12098 mdd: explicitly clear changelogs on deregister
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 5bf8c0eaf9101f98b49029a8651b73bce436db17

Generated at Sat Feb 10 02:49:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.