[LU-5195] HSM: mdt_hsm_cdt_actions.c:104:cdt_llog_process() failed to process HSM_ACTIONS llog Created: 13/Jun/14  Updated: 20/Apr/15  Resolved: 27/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.1
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Patrick Farrell (Inactive) Assignee: James Nunez (Inactive)
Resolution: Fixed Votes: 0
Labels: hsm, patch

Issue Links:
Related
is related to LU-6471 Unexpected Lustre Client LBUG in llog... Resolved
Severity: 3
Rank (Obsolete): 14513

 Description   

Several times while testing HSM in a virtual environment (Centos 6.5 + Lustre 2.5.1 on clients and servers), we've observed what may be HSM_ACTIONS llog corruption.

Here's our internal bug description:
A Lustre filesystem where HSM and changelogs were used started misbehaving. The system was rebooted, and started spewing a lot of these traces in the system log:

<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) tas01-MDT0000: failed to process HSM_ACTIONS llog (rc=-2)
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) tas01-MDD0000: error opening log id 0x1c:1:0: rc = -2
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:556:llog_cat_process_cb()) tas01-MDD0000: cannot find handle for llog 0x1c:1: -2
<3>LustreError: 2990:0:(llog_cat.c:556:llog_cat_process_cb()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) tas01-MDT0000: failed to process HSM_ACTIONS llog (rc=-2)
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) tas01-MDD0000: error opening log id 0x1c:1:0: rc = -2
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) Skipped 600 previous similar messages

At that point the MDS would not accept any HSM request, nor would it deliver any.

The MGT/MDT were unmounted and remounted as ldisk, and the file hsm_actions was deleted. Lustre was then remounted, and HSM became usable again.

We do not have a simple reproducer for this, but it has happened several times.



 Comments   
Comment by Patrick Farrell (Inactive) [ 13/Jun/14 ]

Dump of the MDS is at:
ftp.whamcloud.com:/uploads/LU-5195/cdt_llog_process_HSM_ACTIONS_140613.tar.gz

Comment by Ryan Haasken [ 24/Jul/14 ]

This issue occurred again on the same system. Here is what led up to the incident, according to the person who was working on the system:

I was doing some archiving. The lustre client stopped working, so I rebooted both the client and the MDS.

Now archiving is not working. On the client:

# cd /mnt/tas01/
# lfs hsm_archive fz
Cannot send HSM request (use of fz): No such file or directory

On the MDS:

# cat /proc/fs/lustre/mdt/tas01-MDT0000/hsm/actions
cat: /proc/fs/lustre/mdt/tas01-MDT0000/hsm/actions: No such file or
directory

I don't think I did some unexpected actions from an admin point of view.

At this point, I got on the system and gathered as much relevant information as I could.

I gathered full dk logs, the contents of the hsm_actions file on the MDT, the contents of the hsm proc files, and a dump of the system.

I got the system working again by following the steps in this bug's description. That is,

  1. Unmount the Lustre MDT.
  2. Mount the MDT as ldiskfs.
  3. Remove the file hsm_actions from the root of the MDT.
  4. Unmount the MDT.
  5. Remount the MDT as Lustre. The LustreError messages stopped appearing in the console, and HSM was usable again.

After I got the HSM working again, I checked what would happen if I replaced the hsm_actions file on the MDT with the "unhealthy" one which was in place when HSM was not working. When I did this and remounted the MDT as Lustre, I got the same LustreErrors in the console log again. Replacing the hsm_actions file with the one which was previously in place got it working again.

Comment by Ryan Haasken [ 24/Jul/14 ]

The logs and dump mentioned in the above comment have been uploaded to the whamcloud ftp server.

ftp.whamcloud.com:/uploads/LU-5195/LU-5195-logs.tar.gz

That tar contains a README describing each file in it.

Comment by Frank Zago (Inactive) [ 12/Aug/14 ]

This bug can be reproduced by inserting the failed hsm_actions on a healthy filesystem.

Proposed fix: http://review.whamcloud.com/11419

Comment by James Nunez (Inactive) [ 27/Aug/14 ]

Landed to master (2.7.0)

Comment by James Nunez (Inactive) [ 27/Aug/14 ]

Patch for b2_5 at http://review.whamcloud.com/#/c/11619/

Comment by Aurelien Degremont (Inactive) [ 21/Sep/14 ]

Is there some reasons to prevent the b2_5 patch to also land? Seems an interesting fix, just missing a +2...

Comment by James Nunez (Inactive) [ 22/Sep/14 ]

Aurelien,

When we start landing patches for 2.5.4, this patch will be considered for that release.

Generated at Sat Feb 10 01:49:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.