Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2079

Error reading changelog_users file preventing successful changelog setup during init

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Blocker
    • None
    • Lustre 2.4.0
    • 3
    • 4335

    Description

      I'm having issues mounting our 2.3.51-2chaos based FS after rebooting the clients. I see the following messages on the console:

      LustreError: 5872:0:(client.c:1116:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881009598400 x1414745503563778/t0(0) o101->MGC172.20.5.2@o2ib500@172.20.5.2@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      ib0: no IPv6 routers present
      LustreError: 5900:0:(client.c:1116:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880810225800 x1414745503563780/t0(0) o101->MGC172.20.5.2@o2ib500@172.20.5.2@o2ib500:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: Skipped 1 previous similar message
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: Skipped 2 previous similar messages
      LustreError: 11-0: lstest-MDT0000-mdc-ffff881029e3c800: Communicating with 172.20.5.2@o2ib500, operation mds_connect failed with -11
      LustreError: Skipped 5 previous similar messages
      LustreError: 5954:0:(lmv_obd.c:1190:lmv_statfs()) can't stat MDS #0 (lstest-MDT0000-mdc-ffff88080ba12000), error -11
      LustreError: 4827:0:(lov_obd.c:937:lov_cleanup()) lov tgt 385 not cleaned! deathrow=0, lovrc=1
      LustreError: 4827:0:(lov_obd.c:937:lov_cleanup()) lov tgt 386 not cleaned! deathrow=1, lovrc=1
      Lustre: Unmounted lstest-client
      LustreError: 5954:0:(obd_mount.c:2332:lustre_fill_super()) Unable to mount  (-11)
      

      I haven't had time to look into the cause, but thought it might be useful to open an issue about it.

      Attachments

        Activity

          [LU-2079] Error reading changelog_users file preventing successful changelog setup during init
          prakash Prakash Surya (Inactive) made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: In Progress [ 3 ] New: Resolved [ 5 ]

          I believe the issue was a bug in previous versions of Lustre which has been detailed in the comments of this issue. It has been resolved and deemed fixed since v2.3.52.

          prakash Prakash Surya (Inactive) added a comment - I believe the issue was a bug in previous versions of Lustre which has been detailed in the comments of this issue. It has been resolved and deemed fixed since v2.3.52.
          prakash Prakash Surya (Inactive) made changes -
          Summary Original: Error reading New: Error reading changelog_users file preventing successful changelog setup during init
          prakash Prakash Surya (Inactive) made changes -
          Assignee Original: Alex Zhuravlev [ bzzz ] New: Prakash Surya [ prakash ]
          prakash Prakash Surya (Inactive) made changes -
          Summary Original: Client unable to mount (-11) New: Error reading
          prakash Prakash Surya (Inactive) added a comment - - edited

          Alex, I reverted 4376, reverted 4169, and truncated the changelog_users file to be zero length. Things are back online and look healthy, so I'll go ahead and resolve this issue. Thanks for the help!

          prakash Prakash Surya (Inactive) added a comment - - edited Alex, I reverted 4376 , reverted 4169 , and truncated the changelog_users file to be zero length. Things are back online and look healthy, so I'll go ahead and resolve this issue. Thanks for the help!

          > I'll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as "cannot reproduce"
          it should be OK to just truncate it

          > Also, what does LMA stand for and/or what's its purpose? Just curious.
          it stands for lustre metadata attributes (struct lustre_mdt_attrs)

          bzzz Alex Zhuravlev added a comment - > I'll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as "cannot reproduce" it should be OK to just truncate it > Also, what does LMA stand for and/or what's its purpose? Just curious. it stands for lustre metadata attributes (struct lustre_mdt_attrs)

          OK. That landed between 2.3.51 and 2.3.52.. We started seeing the message when we upgraded the test system to 2.3.51-Xchaos (from orion-2_3_49_92_1-72chaos). We haven't seen it on our production Grove FS, but were much more conservative with it's upgrade process, jumping from orion-2_3_49_54_2-68chaos to 2.3.54-6chaos.

          I think I'm going to chalk this up to the FIDs being shared unless we have evidence to the contrary. I'll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as "cannot reproduce".

          Also, what does LMA stand for and/or what's its purpose? Just curious.

          prakash Prakash Surya (Inactive) added a comment - OK. That landed between 2.3.51 and 2.3.52.. We started seeing the message when we upgraded the test system to 2.3.51-Xchaos (from orion-2_3_49_92_1-72chaos). We haven't seen it on our production Grove FS, but were much more conservative with it's upgrade process, jumping from orion-2_3_49_54_2-68chaos to 2.3.54-6chaos. I think I'm going to chalk this up to the FIDs being shared unless we have evidence to the contrary. I'll plan to remove or truncate the file to zero length (does it matter?), and if that goes fine, we can close this ticket as "cannot reproduce". Also, what does LMA stand for and/or what's its purpose? Just curious.

          >> When was the "shared FID" bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I'd like to verify.

          commit 155e4b6cf45cc0ab21f72d94e5cccbd7a0939c58
          Author: Alex Zhuravlev <alexey.zhuravlev@intel.com>
          Date: Tue Oct 2 23:52:42 2012 +0400

          LU-2075 fld: use predefined FIDs

          and let OSD do mapping to the names internally.

          so, during the landing process we returned back to the schema when LMA is set by OSD itself (otherwise we'll have to set it in many places, in contrast with just osd-zfs and osd-ldiskfs now). so now every object created with OSD API is supposed to have LMA (which later can be used by LFSCK, for example).

          bzzz Alex Zhuravlev added a comment - >> When was the "shared FID" bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I'd like to verify. commit 155e4b6cf45cc0ab21f72d94e5cccbd7a0939c58 Author: Alex Zhuravlev <alexey.zhuravlev@intel.com> Date: Tue Oct 2 23:52:42 2012 +0400 LU-2075 fld: use predefined FIDs and let OSD do mapping to the names internally. so, during the landing process we returned back to the schema when LMA is set by OSD itself (otherwise we'll have to set it in many places, in contrast with just osd-zfs and osd-ldiskfs now). so now every object created with OSD API is supposed to have LMA (which later can be used by LFSCK, for example).

          how often do you see "can't get LMA" ?

          Looking back at the logs, it looks like we've seen it about 15 times on the test MDS for 3 distinct FIDs

          LustreError: 33431:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bd0:0x4f:0x0]: rc = -2
          
          LustreError: 33231:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bda:0x3:0x0]: rc = -2
          
          LustreError: 33244:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bda:0x4:0x0]: rc = -2
          

          I see it on the production OSTs frequently:

          <ConMan> Console [grove214] log at 2012-11-16 04:00:00 PST.
          2012-11-16 04:18:37 LustreError: 5777:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ac1:0x0]: rc = -2
          2012-11-16 04:19:10 LustreError: 7522:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ac4:0x0]: rc = -2
          2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10aca:0x0]: rc = -2
          2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) Skipped 3 previous similar messages
          2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ad7:0x0]: rc = -2
          2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) Skipped 5 previous similar messages
          
          <ConMan> Console [grove214] log at 2012-11-16 05:00:00 PST.
          

          When was the "shared FID" bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I'd like to verify.

          prakash Prakash Surya (Inactive) added a comment - how often do you see "can't get LMA" ? Looking back at the logs, it looks like we've seen it about 15 times on the test MDS for 3 distinct FIDs LustreError: 33431:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bd0:0x4f:0x0]: rc = -2 LustreError: 33231:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bda:0x3:0x0]: rc = -2 LustreError: 33244:0:(osd_object.c:410:osd_object_init()) lstest-MDT0000: can't get LMA on [0x200000bda:0x4:0x0]: rc = -2 I see it on the production OSTs frequently: <ConMan> Console [grove214] log at 2012-11-16 04:00:00 PST. 2012-11-16 04:18:37 LustreError: 5777:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ac1:0x0]: rc = -2 2012-11-16 04:19:10 LustreError: 7522:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ac4:0x0]: rc = -2 2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10aca:0x0]: rc = -2 2012-11-16 04:20:16 LustreError: 7362:0:(osd_object.c:410:osd_object_init()) Skipped 3 previous similar messages 2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) ls1-OST00d6: can't get LMA on [0x100000000:0x10ad7:0x0]: rc = -2 2012-11-16 04:22:37 LustreError: 5770:0:(osd_object.c:410:osd_object_init()) Skipped 5 previous similar messages <ConMan> Console [grove214] log at 2012-11-16 05:00:00 PST. When was the "shared FID" bug fixed? I tried to grep through the master git logs, but nothing was immediately apparent to me. I am OK with removing the file and moving on, just as long as this issue wont come up in the future, which I'd like to verify.

          People

            prakash Prakash Surya (Inactive)
            prakash Prakash Surya (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: