Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
Lustre 2.1.1
-
3
-
4003
Description
It seems the MDT catalog file may be damaged on our test filesystem. We were doing recovery testing with the patch for LU-1352. Sometime after power-cycling the MDS and letting it go through recovery, clients started getting EFAULT writing to lustre. These failures are accompanied by the following console errors on the MDS.
Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) no free catalog slots for log... Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_cat.c:81:llog_cat_new_log()) Skipped 3 previous similar messages Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) write one catalog record failed: -28 Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(llog_obd.c:454:llog_obd_origin_add()) Skipped 3 previous similar messages Jun 28 12:08:45 zwicky-mds2 kernel: LustreError: 11841:0:(mdd_object.c:1330:mdd_changelog_data_store()) changelog failed: rc=-28 op17 t[0x200de60af:0x17913:0x0]
I mentioned this in LU-1570, but I figured a new ticket was needed.
Yes, the changelogs could definitely be a factor. Once there is a registered changelog user, the changelogs are kept on disk until they are consumed. That ensures that if e.g. Robinhood crashes, or has some other problem for a day or four, that it won't have to do a full scan just to recover the state again.
However, if the ChangeLog user is not unregistered, the changelogs will be kept until they run out of space. I suspect that is the root cause here, and should be investigated further. This bug should be CC'd to Jinshan and Aurelien Degremont, who are working on HSM these days.
Cheers, Andreas