[LU-414] error looking up logfile Created: 14/Jun/11  Updated: 11/Oct/17  Resolved: 11/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

CHAOS4.4 (RHEL5.4), lustre 1.8.5.0-5chaos


Attachments: Text File CATALOGS_translated.txt     File console.momus-mds1.gz    
Severity: 3
Rank (Obsolete): 10221

 Description   

Our admins tried to add 8 new OSS nodes to an existing lustre server cluster running 1.8.5.0-5chaos. There were 16 exiting OSS with 15 OSTs each, for a total of 240 old OSTs. There are also 15 OSTs on each of the new OSS, for a total of 120 new OSTs.

When the new OSTs were brought up, it looks like at least 54 of the OSTs failed to be configured correctly on the MDS, and are stuck in the IN (inactive) state according to "lctl dl". I don't see a pattern to which OSTs one which new OSS failed.

This looks similar to bug 22658 that we have seen in the past:

2011-06-14 11:57:17 LustreError: 1432:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x10612404:0x0: rc -2
2011-06-14 11:57:17 LustreError: 1432:0:(llog_obd.c:200:llog_setup()) obd lsd-OST012e-osc ctxt 2 lop_setup=ffffffff885b3dc0 failed -2
2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4242:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4258:osc_llog_init()) osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' rc=-2
2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4260:osc_llog_init()) logid 0x10612404:0x0
2011-06-14 11:57:17 LustreError: 1432:0:(lov_log.c:253:lov_llog_init()) error osc_llog_init idx 302 osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' (rc=-2)
2011-06-14 11:57:17 LustreError: 1432:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x62800000028:0x10612404: rc -2
2011-06-14 11:57:17 LustreError: 1432:0:(llog_obd.c:200:llog_setup()) obd lsd-OST0130-osc ctxt 2 lop_setup=ffffffff885b3dc0 failed -2
2011-06-14 11:57:17 LustreError: 1444:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 302 (-107)
2011-06-14 11:57:17 LustreError: 1444:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST012e_UUID failed at llog_origin_connect: -107
2011-06-14 11:57:17 Lustre: lsd-OST012e_UUID: Sync failed deactivating: rc -107

The admins decided to reboot the MDS, the MDS is still unable to activate those OSTs (at least, I assume that it is the same set of OSTs):

2011-06-14 12:49:32 LustreError: 9611:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 258 (-107)
2011-06-14 12:49:32 LustreError: 9611:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST0102_UUID failed at llog_origin_connect: -107
2011-06-14 12:49:32 Lustre: lsd-OST0102_UUID: Sync failed deactivating: rc -107
2011-06-14 12:49:32 LustreError: 9612:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 259 (-107)
2011-06-14 12:49:32 LustreError: 9646:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST0125_UUID failed at llog_origin_connect: -107
2011-06-14 12:49:32 LustreError: 9646:0:(mds_lov.c:1044:__mds_lov_synchronize()) Skipped 20 previous similar messages
2011-06-14 12:49:32 Lustre: lsd-OST0125_UUID: Sync failed deactivating: rc -107

Notice there is no warning about "error looking up logfile", but the lov_llog_origin_connect() is still failing.

I suspect that lov_llog_origin_connect() is getting error code 107, ENOTCONN, from llog_obd2ops(), meaning that the llog_ctxt *ctxt is NULL. I say that, because watching the logs, I see an RPC between the MDS and OSS nodes complete successfully, but I can't see an RPC being sent after the "lov_llog_origin_connect()) connect 256/360" lines in the log.

It appears that at the ptlrpc level, the mdt and ost are in fact fully connected. The import/export appear to be set up.

I am beginning to suspect that the "fix" for bug 22658 that allows the mds to start up when there are missing log files just lets the server get stuck at this next point in the code.

Also, I think there is pretty clearly some bug in Lustre's initial creation of ost llog files on the mds.

I am attaching the mds console log for now. I can package up some more detailed lustre logs tomorrow.



 Comments   
Comment by Peter Jones [ 15/Jun/11 ]

Hongchao

Could you please look at this one?

Thanks

Peter

Comment by Hongchao Zhang [ 15/Jun/11 ]

this error is a little different from the 22658,
In this issue, the CATLOG of some OSC doesn't exist, but there is a record in the CATLOG of the MDT, then it cause "llog_setup" to fail in "osc_llog_init".
in 22658, the CATLOG of the OSC does exist, but the actual plain log files(containing the real operation log info) doesn't exist.

and there is still two "error looking up logfile" after the MDS reboot in your attached console log,
...
2011-06-14 12:46:52 LustreError: 8762:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x10612404:0x0: rc -2
2011-06-14 12:46:52 LustreError: 8762:0:(llog_obd.c:200:llog_setup()) obd lsd-OST012e-osc ctxt 2 lop_setup=ffffffff885aedc0 failed -2
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4242:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4258:osc_llog_init()) osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' rc=-2
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4260:osc_llog_init()) logid 0x10612404:0x0
2011-06-14 12:46:52 LustreError: 8762:0:(lov_log.c:253:lov_llog_init()) error osc_llog_init idx 302 osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' (rc=-2)
2011-06-14 12:46:52 LustreError: 8762:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x62800000028:0x10612404: rc -2
2011-06-14 12:46:52 LustreError: 8762:0:(llog_obd.c:200:llog_setup()) obd lsd-OST0130-osc ctxt 2 lop_setup=ffffffff885aedc0 failed -2
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4242:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4258:osc_llog_init()) osc 'lsd-OST0130-osc' tgt 'lsd-MDT0000' rc=-2
2011-06-14 12:46:52 LustreError: 8762:0:(osc_request.c:4260:osc_llog_init()) logid 0x62800000028:0x10612404
2011-06-14 12:46:52 LustreError: 8762:0:(lov_log.c:253:lov_llog_init()) error osc_llog_init idx 304 osc 'lsd-OST0130-osc' tgt 'lsd-MDT0000' (rc=-2)
...

Are the same count of OST stuck in the IN (inactive) state after you reboot the MDS?

for the llog, I'm afraid whether there is a problem which some uninitialized data is read in the MDS's CATLOG?
Are the index of these newly added OSTs larger than the existing ones?

Comment by Christopher Morrone [ 15/Jun/11 ]

Yes, as far as I can tell they are all larger than existing ones. OSTs 0000-00ef are on the original 16 nodes, and 00f0-0167 are on the new 8 nodes.

I do believe that it was the same inactive OSTs after reboot. But the logs squash some messages, making it a little hard to verify 100%. The sysadmin is pretty certain that it was the same count of OSTs that was inactive before and after reboot, and some spot checking of the logs verifies at least those OST messages that are not squashed are the same.

Comment by Christopher Morrone [ 15/Jun/11 ]

See the new attachment. I translated the MDS's CATALOGS file into a readable form, and it is pretty clear which entries are incorrect. Any that look like "10612404:0" or "64000000028:1580334" are bad. They have no matching file in the OBJECTS directory, and they correspond with the OSTs that are stuck in the inactive state.

Comment by Hongchao Zhang [ 15/Jun/11 ]

yes, the MDS's CATALOGS is corrupted! and I suspect it is caused by the disorderly call of mds_lov_update_desc for the
newly added OSTs, if one OST with larger index is initialized before those with small index, there will leave an
uninitialized area in the log file, which cause this issue.

Comment by Hongchao Zhang [ 16/Jun/11 ]

Hi Chris,
Could you please try to delete these insane records in the MDS's CATALOGS to check whether the problem is fixed? Thanks!

btw, I have written the following patch and will push it to Gerrit after testing it locally,

diff --git a/lustre/obdclass/llog_lvfs.c b/lustre/obdclass/llog_lvfs.c
index 43d88a3..e6d6534 100644
— a/lustre/obdclass/llog_lvfs.c
+++ b/lustre/obdclass/llog_lvfs.c
@@ -833,9 +833,10 @@ int llog_put_cat_list(struct obd_device *obd, struct obd_device *disk_obd,

{ struct lvfs_run_ctxt saved; struct l_file *file; + void *buf = NULL; int rc, rc1 = 0; int size = sizeof(*idarray) * count; - loff_t off = idx * sizeof(*idarray); + loff_t filesize, off = idx * sizeof(*idarray); if (!count) GOTO(out1, rc = 0); @@ -856,6 +857,17 @@ int llog_put_cat_list(struct obd_device *obd, struct obd_device *disk_obd, GOTO(out, rc = -ENOENT); }

+ filesize = i_size_read(file->f_dentry->d_inode);
+ if (filesize < off)

{ + loff_t count = off - filesize; + OBD_ALLOC(buf, count); + if (buf == NULL) + GOTO(out, rc = -ENOMEM); + + fsfilt_write_record(disk_obd, file, buf, count, &filesize, 1); + OBD_FREE(buf, count); + }

+
rc = fsfilt_write_record(disk_obd, file, idarray, size, &off, 1);
if (rc) {
CDEBUG(D_INODE,"OBD filter: error writeing %s: rc %d\n",

Comment by Christopher Morrone [ 16/Jun/11 ]

Yes, I'll zero the records and see if it will work. We have a downtime scheduled for Tuesday, June 21, so I will try it then.

Comment by Hongchao Zhang [ 20/Jun/11 ]

the rhel5 kernel version for b1_8 has been updated recently(update to 2.6.18-238.12.1), and some time is spent on
preparing the new build&test environment, then the local test for the patch is delayed, and it could be complete tomorrow.

Comment by Hongchao Zhang [ 21/Jun/11 ]

the patch is at http://review.whamcloud.com/#change,987

Comment by Christopher Morrone [ 21/Jun/11 ]

Zeroing the CATALOGS file got everything working this morning.

Comment by Hongchao Zhang [ 08/Aug/11 ]

as per the code line, the newly allocated block in the CATLOG file is initialized with zero in the current b1_8 and master,
and is also the case for v1_8_5, then it should not a use-uninitialized-data issue, but the content of the CATLOG does
indicate it is, and the first insane data appears just beyond 8192 (the 256th entry).

Hi Chris,
Is the block size in MDT's device 8192? and could you please help to paste your source code tree here? Thanks!

Comment by Christopher Morrone [ 08/Aug/11 ]

https://github.com/chaos/lustre/tree/1.8.5.0-5chaos

I will need to check the block size.

Comment by Hongchao Zhang [ 10/Aug/11 ]

in osc_llog_init, "llog_get_cat_list" and "llog_put_cat_list" is protected by the mutex "obd_llog_cat_process" of MDT,
then the problem should not be related to the race between OSCs, the remaining possible case is the problem during "fsfilt"
writing the record into log file.

Hi Chris,
Is your rhel5.4 the official one? I want to do some tests on lustre 1.8.5.0-5chaos.

Comment by Christopher Morrone [ 15/Aug/11 ]

No, it is CHAOS4.4, so some changes from RHEL5.4.

The MDS is using a 3ware card and the array is configured as RAID10. Stripe size is 64k.

Comment by Hongchao Zhang [ 23/Aug/11 ]

Hi Chris,

could you please tell me where can I get the CHAOS4.4? Thanks!

Comment by Hongchao Zhang [ 23/Aug/11 ]

I have tested the 1.8.5.0-5chaos against 2.6.18-194.17.1, it used 2 OSS(one has 252 OSTs, the other has 50 OSTs) and
no such problem was triggered.

FYI, I have encountered an error "EXTRA_DIST: variable 'bin_SCRIPTS' is used but 'bin_SCRIPTS' is undefined" during
running "sh autogen.sh", and I skipped it and no other issues are shown up in following compile&test.

Comment by Christopher Morrone [ 26/Aug/11 ]

ftp://gdo-lc.ucllnl.org/pub/projects/chaos/

Can you tell me more about your test? Why did you have the OST count so imbalanced?

We saw this with only 15 OSTs on each of 8 OSS, so I wouldn't think such a high count would be necessary. Did you statically allocate the index numbers to the osts, and alternate them between nodes? Did you issue the mount commands in parallel?

Comment by Hongchao Zhang [ 30/Aug/11 ]

the test is run on my two virtual nodes using loop device, one node runs MDT+(252 OSTs), the other runs 50 OSTs
the I don't specify the index value of these OSTs and don't alternate them, the mount commands aren't in parallel.

the reason of testing it in this case is the operations of CATALOGS "llog_get_cat_list" and "llog_put_cat_list",
both are protected by the mutex "obd_llog_cat_process" of MDT, then the problem is the "fsfilt" layer during writing
the record into log file.

Comment by Christopher Morrone [ 30/Aug/11 ]

Ah, I think thats part of your problem.

We DO specify our index values. If you aren't doing that, then the index numbers will just be sequentially allocated as the OSTs connect the first time. And if you are not mounting the first time in parallel, then there is no chance to reproduce the issue we are seeing.

Comment by Hongchao Zhang [ 02/Sep/11 ]

I specify the index of the OSTs and mount them not sequentially, the problem still not occurs. and I can't mount these OSTs
in parallel in my test environment, and could you please help to test it?

Comment by John Fuchs-Chesney (Inactive) [ 04/Mar/14 ]

Chris – I'm doing some cleanup work on JIRA issues.
Would you prefer that I keep this open and unresolved?
Or may I mark it as 'resolved – cannot reproduce'?
Many thanks,
~ jfc.

Comment by Christopher Morrone [ 04/Mar/14 ]

I think it is very likely that this is still a problem. We seem to hit it in every version of Lustre. Our admins now have a permanent rule of manually adding OSTs sequentially to Lustre because this code is so broken. That is not acceptable. It turns an operation that should take under a minute into one that takes well over an hour.

Comment by Hongchao Zhang [ 18/Jul/14 ]

As of b2_4 (starting to use LOD/OSP instead of LOV/OSC at MDT), this issue was fixed in "osp_sync_llog_init", it will recreate
the log file if it didn't exist previously (previously it only create the log file if the log ID is zero)

static int osp_sync_llog_init(const struct lu_env *env, struct osp_device *d)
{
        ...
        ctxt = llog_get_context(obd, LLOG_MDS_OST_ORIG_CTXT);
        LASSERT(ctxt);

        if (likely(logid_id(&osi->osi_cid.lci_logid) != 0)) {
                rc = llog_open(env, ctxt, &lgh, &osi->osi_cid.lci_logid, NULL,
                               LLOG_OPEN_EXISTS);
                /* re-create llog if it is missing */
                if (rc == -ENOENT)
                        logid_set_id(&osi->osi_cid.lci_logid, 0);
                else if (rc < 0)
                        GOTO(out_cleanup, rc);
        }

        if (unlikely(logid_id(&osi->osi_cid.lci_logid) == 0)) {
                rc = llog_open_create(env, ctxt, &lgh, NULL, NULL);
                if (rc < 0)
                        GOTO(out_cleanup, rc);
                osi->osi_cid.lci_logid = lgh->lgh_id;
        }

        LASSERT(lgh != NULL);
        ctxt->loc_handle = lgh;

        rc = llog_cat_init_and_process(env, lgh);
        if (rc)
                GOTO(out_close, rc);

        rc = llog_osd_put_cat_list(env, d->opd_storage, d->opd_index, 1,
                                   &osi->osi_cid);
        ...
}
Generated at Sat Feb 10 01:06:47 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.