Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8508

kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre DNE2 Testing, noticed some issue with latest master builds. When mounting storage targets on servers other than ones with the MGT i get a kernel panic with the below; I have validated this is not (to the best of my ability) network, I have also tried and FE build which works and another master build (3419) which works:

       
      [root@zlfs2-oss1 ~]# mount -vvv -t lustre /dev/nvme0n1 /mnt/MDT0000
      arg[0] = /sbin/mount.lustre
      arg[1] = -v
      arg[2] = -o
      arg[3] = rw
      arg[4] = /dev/nvme0n1
      arg[5] = /mnt/MDT0000
      source = /dev/nvme0n1 (/dev/nvme0n1), target = /mnt/MDT0000
      options = rw
      checking for existing Lustre data: found
      Reading CONFIGS/mountdata
      Writing CONFIGS/mountdata
      mounting device /dev/nvme0n1 at /mnt/MDT0000, flags=0x1000000 options=osd=osd-ldiskfs,user_xattr,errors=remount-ro,mgsnode=192.168.5.21@o2ib,virgin,update,param=mgsnode=192.168.5.21@o2ib,svname=zlfs2-MDT0000,device=/dev/nvme0n1
      mount.lustre: cannot parse scheduler options for '/sys/block/nvme0n1/queue/scheduler'
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) LBUG
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:Kernel panic - not syncing: LBUG
      

      Attached is some debugging / more info.

      Builds Tried:
      master b3424 - issues
      master b3423 - issues
      master b3420 - issues
      master b3419 - works
      fe 2.8 b18 - works

      Attachments

        Issue Links

          Activity

            [LU-8508] kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            pjones Peter Jones added a comment -

            Landed for 2.9

            pjones Peter Jones added a comment - Landed for 2.9

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22004/
            Subject: LU-8508 nodemap: improve object handling in cache saving
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 45cb603b4352a73077dcc45ec2cdea403837a7ba

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22004/ Subject: LU-8508 nodemap: improve object handling in cache saving Project: fs/lustre-release Branch: master Current Patch Set: Commit: 45cb603b4352a73077dcc45ec2cdea403837a7ba
            pjones Peter Jones added a comment -

            Let's see how the second review goes to see whether the refresh is needed

            pjones Peter Jones added a comment - Let's see how the second review goes to see whether the refresh is needed

            Hey Peter,

            No problem. I made the changes, would it be better to upload them and face the tests again, or leave it as is?

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, No problem. I made the changes, would it be better to upload them and face the tests again, or leave it as is? Thanks, Kit
            pjones Peter Jones added a comment -

            Hi Kit

            I checked with Oleg and you are right - sorry about that - so I have requested a second reviewer so that we can get this landed

            Peter

            pjones Peter Jones added a comment - Hi Kit I checked with Oleg and you are right - sorry about that - so I have requested a second reviewer so that we can get this landed Peter

            Hey Peter,

            Are we talking about change 22004? I only see two style comments from Andreas. There are a few over 80 chars autocomments as well, but I thought we were ignoring those now to match the Linux style guide. I'll refresh it, but I want to make sure I'm not missing something.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, Are we talking about change 22004? I only see two style comments from Andreas. There are a few over 80 chars autocomments as well, but I thought we were ignoring those now to match the Linux style guide. I'll refresh it, but I want to make sure I'm not missing something. Thanks, Kit
            pjones Peter Jones added a comment -

            Kit

            I think that at the moment a second reviewer is holding off in anticipation of another version being forthcoming given that there are quite a number of comments so I tihnk that it would be good to refresh it

            Peter

            pjones Peter Jones added a comment - Kit I think that at the moment a second reviewer is holding off in anticipation of another version being forthcoming given that there are quite a number of comments so I tihnk that it would be good to refresh it Peter

            Hey Peter,

            I wasn't planning on it since he +1'd it, unless there were other issues found, but I can if that's desired.

            • Kit
            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, I wasn't planning on it since he +1'd it, unless there were other issues found, but I can if that's desired. Kit
            pjones Peter Jones added a comment -

            Kit

            Will you be refreshing the patch in light of Andreas's review feedback?

            Peter

            pjones Peter Jones added a comment - Kit Will you be refreshing the patch in light of Andreas's review feedback? Peter

            BTW the cause of the second bug is that if a new OST mounts before the MGC has pulled the nodemap config from the MGS, it creates a new blank config on disk. Part of that code was erroneously assuming that it was in the MGS, as normally all new records are created there and then sent to the OSTs, so it was returning an error. That's why the first OST failed to mount. When the other OSTs were mounted, the MGC was already connected to the MGS, so it was able to pull the config and save it properly. That's why the other OSTs were able to mount after rebooting, but nvme0n1 wasn't able to until the others were mounted.

            kit.westneat Kit Westneat (Inactive) added a comment - BTW the cause of the second bug is that if a new OST mounts before the MGC has pulled the nodemap config from the MGS, it creates a new blank config on disk. Part of that code was erroneously assuming that it was in the MGS, as normally all new records are created there and then sent to the OSTs, so it was returning an error. That's why the first OST failed to mount. When the other OSTs were mounted, the MGC was already connected to the MGS, so it was able to pull the config and save it properly. That's why the other OSTs were able to mount after rebooting, but nvme0n1 wasn't able to until the others were mounted.

            People

              kit.westneat Kit Westneat (Inactive)
              adam.j.roe Adam Roe (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: