[LU-13854] lustre-MDD0000: next log does not exist! Created: 05/Aug/20  Updated: 14/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Upstream
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Quentin Bouget Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Epic/Theme: changelog
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Trying to create a file after setting up changelogs on a fresh new FS yields an IO error:

# lustre/tests/llmount.sh
(...)
# lctl set_param mdd.*.changelog_mask=MTIME
mdd.lustre-MDT0000.changelog_mask=MTIME
# lctl --device lustre-MDT0000 changelog_register
lustre-MDT0000: Registered changelog userid 'cl1'
#touch /mnt/lustre/file
touch: setting times of '/mnt/lustre/file': Input/output error

dmesg provides some interesting information:

# dmesg -H | tail -n 3
[Aug 5 11:23] Lustre: lustre-MDD0000: changelog on
[Aug 5 11:25] LustreError: 23303:0:(llog_cat.c:544:llog_cat_current_log()) lustre-MDD0000: next log does not exist!
[  +0.003231] LustreError: 24631:0:(llite_lib.c:1707:ll_md_setattr()) md_setattr fails: rc = -5

Happy hunting



 Comments   
Comment by Peter Jones [ 05/Aug/20 ]

Is this something that you are planning to investigate?

Comment by Quentin Bouget [ 05/Aug/20 ]

No, this was just a test system I setup to review a patch.
This is not something we hit in production (yet).

Comment by Ellis Wilson [ 14/Jul/22 ]

I believe I ran into this:

root@node:/# chmod -R 777 /lustrefs
chmod: changing permissions of '/lustrefs': Input/output error
root@node:/# dmesg -T
...
[Mon Jun 27 17:25:18 2022] LustreError: 1259:0:(llite_lib.c:1712:ll_md_setattr()) md_setattr fails: rc = -5

Looking at the original error, I see the following at the exact time of chmod failure on the MDT:
[250875.733525] LustreError: 32002:0:(llog_cat.c:544:llog_cat_current_log()) lustrefs-MDD0000: next log does not exist!
 
There are no surrounding lines that suggest there weren't free slots or something else to indicate why this is happening.
 
The description seems to suggest the changelog mask may be involved.  On my clusters I use:
mdd.lustrefs-MDT0000.changelog_mask=all-ATIME-FLRW-GXATR-MARK-MIGRT-NOPEN-OPEN-RESYNC-XATTR
 
Looking across the many clusters this bug is hit approximately once per day.  Are there any suggestions for ways I could help provide additional debugging information?  Also, is there a reason to not return EAGAIN here rather than EIO (I'm not intimately familiar with this code so I'm sure there is a good reason)?  Looking at this block it feels like just trying again would be reasonable (and certainly these nodes do recover – just that single command gets EIO back).  Failing a whole job for just a one-off retryable concurrency issue is unfortunate.
 
   536   /* Sigh, the chd_next_log and chd_current_log is initialized   537    * in declare phase, and we do not serialize the catlog
   538    * accessing, so it might be possible the llog creation
   539    * thread (see llog_cat_declare_add_rec()) did not create
   540    * llog successfully, then the following thread might
   541    * meet this situation. */
   542   if (IS_ERR_OR_NULL(cathandle->u.chd.chd_next_log)) {
   543     CERROR("%s: next log does not exist!\n",
   544            cathandle->lgh_ctxt->loc_obd->obd_name);
   545     loghandle = ERR_PTR(-EIO);
   546     if (cathandle->u.chd.chd_next_log == NULL) {
   547       /* Store the error in chd_next_log, so
   548        * the following process can get correct
   549        * failure value */
   550       cathandle->u.chd.chd_next_log = loghandle;
{{   551     }}}
   552     GOTO(out_unlock, loghandle);
 
 

Generated at Sat Feb 10 03:04:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.