[LU-16232] emergency llog cleanup server scripts Created: 11/Oct/22  Updated: 31/Aug/23  Resolved: 30/May/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Improvement Priority: Minor
Reporter: Mikhail Pershin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Attachments: HTML File collect_updatelog    
Issue Links:
Related
Rank (Obsolete): 9223372036854775807

 Description   

There can be situations when update llog or changelog files are corrupted and we are just removing/truncating them. Scripts are needed to remove corrupted llogs properly with all plain llogs, so no orphaned data remains on server



 Comments   
Comment by Mikhail Pershin [ 12/Oct/22 ]

Copied from EX-4969:

For both scripts the steps to cleanup problematic llogs are:

  • mount MDT filesystem locally on server as ldiskfs mount
  • run script first in dry-run mode to make sure it parses llogs as needed and that all needed tools are in place:
 # bash remove_(changelog|updatelog) -n <ldiskfs_mount>
  • run it to save all llogs for analysis
# bash remove_(changelog|updatelog) -n -z /tmp/llogs_saved <ldiskfs_mount>
  • to be sure check that /tmp/llogs_saved.tar.gz (or whatever name prefix you've used) exists and has all llogs inside:
# ls -ali /tmp/llogs_saved.tar.gz
# tar -tf /tmp/llog_saved.tar.gz
  • finally run script to delete all llogs:
 # bash remove_(changelog|updatelog) <ldiskfs_mount>

Note: for better llogs compression xz can be used as well, pass it to the script via GZIP env variable:

# GZIP=xz bash remove_(changelog|updatelog) -n -z /tmp/llogs_saved <ldiskfs_mount>

Archive name will ends with .xz in that case instead of .gz

Comment by Gerrit Updater [ 12/Oct/22 ]

"Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48838
Subject: LU-16232 scripts: changelog/updatelog emergency cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9129cea32def94999b67e942c69d0a5517613fd3

Comment by Gerrit Updater [ 02/Nov/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/48838/
Subject: LU-16232 scripts: changelog/updatelog emergency cleanup
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b533700add91fe4220f50d057a470e0b6f4893c9

Comment by Peter Jones [ 02/Nov/22 ]

Landed for 2.16

Comment by Andreas Dilger [ 23/Dec/22 ]

In further discussion, it would be useful to allow a "-m <mdt_index>" argument to the script to have it selectively delete only the update logs for the specified MDT index. Otherwise, if the script is run on a remote MDT, then it will delete the update logs for all other MDTs, and then those MDTs will then have issues during operation and/or later recovery.

Comment by Chris Hunter (Inactive) [ 09/Jan/23 ]

I receive an error message when using the script:

./remove_updatelog.lu16232.sh: line 100: syntax error near unexpected token `<'
./remove_updatelog.lu16232.sh: line 100: `        read -r -d '' -a OPT_MDTS < <(hexdump -v -e '2/8 " %16x" 2/8 "\n"' $catlist |'

appears to be an issue with shell pipe
eg)

read -r -d '' -a OPT_MDTS < <(hexdump -v -e '2/8 " %16x" 2/8 "\n"' $catlist | awk '{print "[0x"$2":0x"$1":0x0]"}')

instead of

read -r -d '' -a OPT_MDTS <<< $(hexdump -v -e '2/8 " %16x" 2/8 "\n"' $catlist | awk '{print "[0x"$2":0x"$1":0x0]"}')
Comment by Mikhail Pershin [ 09/Jan/23 ]

Chris, try it with bash : # bash ./remove_updatelog

The syntax of '< <()' correct in bash (that is process substitution, creates temp file and redirects from it) but may be not in other shells. Your proposal is possible replacement so it is worth to update script in that manner but for now just call bash explicitly

Comment by Gerrit Updater [ 06/Apr/23 ]

"Mikhail Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50558
Subject: LU-16232 scripts: clean specific MDTs update llogs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a399335cdcc6561077a1400572128b523c461a16

Comment by Mikhail Pershin [ 25/Apr/23 ]

I've just added to the ticket another script which allows to collect update logs on live server node by using debugfs. It helps to get llogs for further analysis and replace multiple commands need to be executed for that

Comment by Gerrit Updater [ 06/May/23 ]

"Yang Sheng <ys@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50876
Subject: LU-16232 script: fix the argument parse
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: aa0ded5d3504187b1f3fe852bb5e464569f5eb7c

Comment by Gerrit Updater [ 19/May/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50558/
Subject: LU-16232 scripts: clean specific MDTs update llogs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 8546628c23b1ba1785e42ec9b1bc4b77acced5c6

Comment by Gerrit Updater [ 31/Aug/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50876/
Subject: LU-16232 script: fix the argument parse
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 99144a595b767ef79acec058c838759bea73c579

Generated at Sat Feb 10 03:25:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.