[LU-1238] record_lcfg() failed with ENOSPC Created: 20/Mar/12  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.3.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Lustre Tag: v2_2_0_0_RC1
Lustre Build: https://build.whamcloud.com/job/lustre-b2_2/11
Distro/Arch: RHEL6.2/x86_64 (kernel version: 2.6.32-220.4.2.el6)

OSSCOUNT=2
OSTCOUNT=2000 (with 1000 OSTs per OSS)
NETTYPE=o2ib
ENABLE_QUOTA=yes


Attachments: Text File lustre-MDT0000.log    
Severity: 3
Rank (Obsolete): 10883

 Description   

While running ost-pools test 5 with 2000 OSTs, after adding 2000 OSTs to one OST pool and then removing the OSTs from the pool, the test failed as follows:

<~snip~>
client-19-ib: Warning, OST lustre-OST041f_UUID still found in pool lustre.testpool
client-19-ib: Warning, OST lustre-OST0420_UUID still found in pool lustre.testpool
<~snip~>

Console log on the combined MGS/MDS showed that:

LustreError: 16312:0:(mgs_llog.c:752:record_lcfg()) failed -28
LustreError: 16340:0:(mgs_llog.c:752:record_lcfg()) failed -28
LustreError: 16340:0:(mgs_llog.c:788:record_base()) error -28: lcfg lustre-MDT0000-mdtlov 0xce022 lustre testpool lustre-OST041f_UUID (null)
LustreError: 16340:0:(mgs_llog.c:788:record_base()) error -28: lcfg lustre-clilov 0xce022 lustre testpool lustre-OST041f_UUID (null)
LustreError: 16369:0:(mgs_llog.c:752:record_lcfg()) failed -28
LustreError: 16369:0:(mgs_llog.c:752:record_lcfg()) Skipped 5 previous similar messages
LustreError: 16369:0:(mgs_llog.c:788:record_base()) error -28: lcfg lustre-MDT0000-mdtlov 0xce022 lustre testpool lustre-OST0420_UUID (null)

Maloo report: https://maloo.whamcloud.com/test_sets/a610c4b2-71cd-11e1-9716-5254004bbbd3

By running llog_reader on CONFIGS/lustre-MDT0000 file on the MGS/MDS node, I found there were 63293 records in that file and 1474 bits were not set. The last several records are:

#64763 (224)marker 2043193 (flags=0x01, v2.2.0.0) lustre-MDT0000-mdtlov 'rem lustre.testpool.lustre-OST041d_UUID' Mon Mar 19 03:01:52 2012-
#64764 (144)pool remove 0:lustre-MDT0000-mdtlov 1:lustre 2:testpool 3:lustre-OST041d_UUID
#64765 (224)marker 2043193 (flags=0x02, v2.2.0.0) lustre-MDT0000-mdtlov 'rem lustre.testpool.lustre-OST041d_UUID' Mon Mar 19 03:01:52 2012-
#64766 (224)marker 2043195 (flags=0x01, v2.2.0.0) lustre-MDT0000-mdtlov 'rem lustre.testpool.lustre-OST041e_UUID' Mon Mar 19 03:02:02 2012-
#64767 (144)pool remove 0:lustre-MDT0000-mdtlov 1:lustre 2:testpool 3:lustre-OST041e_UUID

The OST pool operations consumed most of the records and caused the record count reach to the following limitation:

         /* if it's the last idx in log file, then return -ENOSPC */
         if (loghandle->lgh_last_idx >= LLOG_BITMAP_SIZE(llh) - 1)
                 RETURN(-ENOSPC);
/* (8192 - 88 - 8) * 8 = 64768 */
#define LLOG_BITMAP_SIZE(llh)  ((llh->llh_hdr.lrh_len -         \
                                 llh->llh_bitmap_offset -       \
                                 sizeof(llh->llh_tail)) * 8)

Please find the attached lustre-MDT0000.log for the output of "llog_reader lustre-MDT0000" and see how to resolve this issue.



 Comments   
Comment by Andreas Dilger [ 10/Sep/12 ]

Two problems are visible in this config log:

  • there appear to be thousands of "set_timeout=20" lines that are added, possibly by test-framework.sh? That definitely doesn't help matters
  • I'm not sure anymore why we have the marker lines in the config logs. I think these are only comments, but they aren't very useful in the case of single-line records, especially when there are two marker lines for every single record added. While this won't solve the problem being seen here, it will push it 3x further away.

Alex has recently been reworking how config llogs are processed by the servers, and I wonder if we could simplify this for newer clients as well? Maybe we don't even need marker lines anymore, or we can figure a way not to need them. Similarly, newer servers do not need so many lines to configure their device stack, maybe clients could become more intelligent as well (i.e. given a record with OST+NIDs they can figure everything else out)?

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:14:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.