Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-414

error looking up logfile

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • None
    • CHAOS4.4 (RHEL5.4), lustre 1.8.5.0-5chaos
    • 3
    • 10221

    Description

      Our admins tried to add 8 new OSS nodes to an existing lustre server cluster running 1.8.5.0-5chaos. There were 16 exiting OSS with 15 OSTs each, for a total of 240 old OSTs. There are also 15 OSTs on each of the new OSS, for a total of 120 new OSTs.

      When the new OSTs were brought up, it looks like at least 54 of the OSTs failed to be configured correctly on the MDS, and are stuck in the IN (inactive) state according to "lctl dl". I don't see a pattern to which OSTs one which new OSS failed.

      This looks similar to bug 22658 that we have seen in the past:

      2011-06-14 11:57:17 LustreError: 1432:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x10612404:0x0: rc -2
      2011-06-14 11:57:17 LustreError: 1432:0:(llog_obd.c:200:llog_setup()) obd lsd-OST012e-osc ctxt 2 lop_setup=ffffffff885b3dc0 failed -2
      2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4242:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT
      2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4258:osc_llog_init()) osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' rc=-2
      2011-06-14 11:57:17 LustreError: 1432:0:(osc_request.c:4260:osc_llog_init()) logid 0x10612404:0x0
      2011-06-14 11:57:17 LustreError: 1432:0:(lov_log.c:253:lov_llog_init()) error osc_llog_init idx 302 osc 'lsd-OST012e-osc' tgt 'lsd-MDT0000' (rc=-2)
      2011-06-14 11:57:17 LustreError: 1432:0:(llog_lvfs.c:612:llog_lvfs_create()) error looking up logfile 0x62800000028:0x10612404: rc -2
      2011-06-14 11:57:17 LustreError: 1432:0:(llog_obd.c:200:llog_setup()) obd lsd-OST0130-osc ctxt 2 lop_setup=ffffffff885b3dc0 failed -2
      2011-06-14 11:57:17 LustreError: 1444:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 302 (-107)
      2011-06-14 11:57:17 LustreError: 1444:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST012e_UUID failed at llog_origin_connect: -107
      2011-06-14 11:57:17 Lustre: lsd-OST012e_UUID: Sync failed deactivating: rc -107
      

      The admins decided to reboot the MDS, the MDS is still unable to activate those OSTs (at least, I assume that it is the same set of OSTs):

      2011-06-14 12:49:32 LustreError: 9611:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 258 (-107)
      2011-06-14 12:49:32 LustreError: 9611:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST0102_UUID failed at llog_origin_connect: -107
      2011-06-14 12:49:32 Lustre: lsd-OST0102_UUID: Sync failed deactivating: rc -107
      2011-06-14 12:49:32 LustreError: 9612:0:(lov_log.c:161:lov_llog_origin_connect()) error osc_llog_connect tgt 259 (-107)
      2011-06-14 12:49:32 LustreError: 9646:0:(mds_lov.c:1044:__mds_lov_synchronize()) lsd-MDT0000: lsd-OST0125_UUID failed at llog_origin_connect: -107
      2011-06-14 12:49:32 LustreError: 9646:0:(mds_lov.c:1044:__mds_lov_synchronize()) Skipped 20 previous similar messages
      2011-06-14 12:49:32 Lustre: lsd-OST0125_UUID: Sync failed deactivating: rc -107
      

      Notice there is no warning about "error looking up logfile", but the lov_llog_origin_connect() is still failing.

      I suspect that lov_llog_origin_connect() is getting error code 107, ENOTCONN, from llog_obd2ops(), meaning that the llog_ctxt *ctxt is NULL. I say that, because watching the logs, I see an RPC between the MDS and OSS nodes complete successfully, but I can't see an RPC being sent after the "lov_llog_origin_connect()) connect 256/360" lines in the log.

      It appears that at the ptlrpc level, the mdt and ost are in fact fully connected. The import/export appear to be set up.

      I am beginning to suspect that the "fix" for bug 22658 that allows the mds to start up when there are missing log files just lets the server get stuck at this next point in the code.

      Also, I think there is pretty clearly some bug in Lustre's initial creation of ost llog files on the mds.

      I am attaching the mds console log for now. I can package up some more detailed lustre logs tomorrow.

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              morrone Christopher Morrone
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: