Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.1
-
3
-
14513
Description
Several times while testing HSM in a virtual environment (Centos 6.5 + Lustre 2.5.1 on clients and servers), we've observed what may be HSM_ACTIONS llog corruption.
Here's our internal bug description:
A Lustre filesystem where HSM and changelogs were used started misbehaving. The system was rebooted, and started spewing a lot of these traces in the system log:
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) tas01-MDT0000: failed to process HSM_ACTIONS llog (rc=-2)
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) tas01-MDD0000: error opening log id 0x1c:1:0: rc = -2
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:556:llog_cat_process_cb()) tas01-MDD0000: cannot find handle for llog 0x1c:1: -2
<3>LustreError: 2990:0:(llog_cat.c:556:llog_cat_process_cb()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) tas01-MDT0000: failed to process HSM_ACTIONS llog (rc=-2)
<3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) Skipped 600 previous similar messages
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) tas01-MDD0000: error opening log id 0x1c:1:0: rc = -2
<3>LustreError: 2990:0:(llog_cat.c:192:llog_cat_id2handle()) Skipped 600 previous similar messages
At that point the MDS would not accept any HSM request, nor would it deliver any.
The MGT/MDT were unmounted and remounted as ldisk, and the file hsm_actions was deleted. Lustre was then remounted, and HSM became usable again.
We do not have a simple reproducer for this, but it has happened several times.
Attachments
Issue Links
- is related to
-
LU-6471 Unexpected Lustre Client LBUG in llog_write()
- Resolved