[LU-5195] HSM: mdt_hsm_cdt_actions.c:104:cdt_llog_process() failed to process HSM_ACTIONS llog Created: 13/Jun/14 Updated: 20/Apr/15 Resolved: 27/Aug/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.7.0, Lustre 2.5.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick Farrell (Inactive) | Assignee: | James Nunez (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | hsm, patch | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 14513 | ||||||||
| Description |
|
Several times while testing HSM in a virtual environment (Centos 6.5 + Lustre 2.5.1 on clients and servers), we've observed what may be HSM_ACTIONS llog corruption. Here's our internal bug description: <3>LustreError: 2990:0:(mdt_hsm_cdt_actions.c:104:cdt_llog_process()) tas01-MDT0000: failed to process HSM_ACTIONS llog (rc=-2) At that point the MDS would not accept any HSM request, nor would it deliver any. The MGT/MDT were unmounted and remounted as ldisk, and the file hsm_actions was deleted. Lustre was then remounted, and HSM became usable again. We do not have a simple reproducer for this, but it has happened several times. |
| Comments |
| Comment by Patrick Farrell (Inactive) [ 13/Jun/14 ] |
|
Dump of the MDS is at: |
| Comment by Ryan Haasken [ 24/Jul/14 ] |
|
This issue occurred again on the same system. Here is what led up to the incident, according to the person who was working on the system:
At this point, I got on the system and gathered as much relevant information as I could. I gathered full dk logs, the contents of the hsm_actions file on the MDT, the contents of the hsm proc files, and a dump of the system. I got the system working again by following the steps in this bug's description. That is,
After I got the HSM working again, I checked what would happen if I replaced the hsm_actions file on the MDT with the "unhealthy" one which was in place when HSM was not working. When I did this and remounted the MDT as Lustre, I got the same LustreErrors in the console log again. Replacing the hsm_actions file with the one which was previously in place got it working again. |
| Comment by Ryan Haasken [ 24/Jul/14 ] |
|
The logs and dump mentioned in the above comment have been uploaded to the whamcloud ftp server. ftp.whamcloud.com:/uploads/ That tar contains a README describing each file in it. |
| Comment by Frank Zago (Inactive) [ 12/Aug/14 ] |
|
This bug can be reproduced by inserting the failed hsm_actions on a healthy filesystem. Proposed fix: http://review.whamcloud.com/11419 |
| Comment by James Nunez (Inactive) [ 27/Aug/14 ] |
|
Landed to master (2.7.0) |
| Comment by James Nunez (Inactive) [ 27/Aug/14 ] |
|
Patch for b2_5 at http://review.whamcloud.com/#/c/11619/ |
| Comment by Aurelien Degremont (Inactive) [ 21/Sep/14 ] |
|
Is there some reasons to prevent the b2_5 patch to also land? Seems an interesting fix, just missing a +2... |
| Comment by James Nunez (Inactive) [ 22/Sep/14 ] |
|
Aurelien, When we start landing patches for 2.5.4, this patch will be considered for that release. |