Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8508

kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.8.0
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre DNE2 Testing, noticed some issue with latest master builds. When mounting storage targets on servers other than ones with the MGT i get a kernel panic with the below; I have validated this is not (to the best of my ability) network, I have also tried and FE build which works and another master build (3419) which works:

       
      [root@zlfs2-oss1 ~]# mount -vvv -t lustre /dev/nvme0n1 /mnt/MDT0000
      arg[0] = /sbin/mount.lustre
      arg[1] = -v
      arg[2] = -o
      arg[3] = rw
      arg[4] = /dev/nvme0n1
      arg[5] = /mnt/MDT0000
      source = /dev/nvme0n1 (/dev/nvme0n1), target = /mnt/MDT0000
      options = rw
      checking for existing Lustre data: found
      Reading CONFIGS/mountdata
      Writing CONFIGS/mountdata
      mounting device /dev/nvme0n1 at /mnt/MDT0000, flags=0x1000000 options=osd=osd-ldiskfs,user_xattr,errors=remount-ro,mgsnode=192.168.5.21@o2ib,virgin,update,param=mgsnode=192.168.5.21@o2ib,svname=zlfs2-MDT0000,device=/dev/nvme0n1
      mount.lustre: cannot parse scheduler options for '/sys/block/nvme0n1/queue/scheduler'
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) LBUG
      
      Message from syslogd@zlfs2-oss1 at Aug 16 21:52:33 ...
       kernel:Kernel panic - not syncing: LBUG
      

      Attached is some debugging / more info.

      Builds Tried:
      master b3424 - issues
      master b3423 - issues
      master b3420 - issues
      master b3419 - works
      fe 2.8 b18 - works

      Attachments

        Issue Links

          Activity

            [LU-8508] kernel:LustreError: 3842:0:(lu_object.c:1243:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
            pjones Peter Jones added a comment -

            Let's see how the second review goes to see whether the refresh is needed

            pjones Peter Jones added a comment - Let's see how the second review goes to see whether the refresh is needed

            Hey Peter,

            No problem. I made the changes, would it be better to upload them and face the tests again, or leave it as is?

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, No problem. I made the changes, would it be better to upload them and face the tests again, or leave it as is? Thanks, Kit
            pjones Peter Jones added a comment -

            Hi Kit

            I checked with Oleg and you are right - sorry about that - so I have requested a second reviewer so that we can get this landed

            Peter

            pjones Peter Jones added a comment - Hi Kit I checked with Oleg and you are right - sorry about that - so I have requested a second reviewer so that we can get this landed Peter

            Hey Peter,

            Are we talking about change 22004? I only see two style comments from Andreas. There are a few over 80 chars autocomments as well, but I thought we were ignoring those now to match the Linux style guide. I'll refresh it, but I want to make sure I'm not missing something.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, Are we talking about change 22004? I only see two style comments from Andreas. There are a few over 80 chars autocomments as well, but I thought we were ignoring those now to match the Linux style guide. I'll refresh it, but I want to make sure I'm not missing something. Thanks, Kit
            pjones Peter Jones added a comment -

            Kit

            I think that at the moment a second reviewer is holding off in anticipation of another version being forthcoming given that there are quite a number of comments so I tihnk that it would be good to refresh it

            Peter

            pjones Peter Jones added a comment - Kit I think that at the moment a second reviewer is holding off in anticipation of another version being forthcoming given that there are quite a number of comments so I tihnk that it would be good to refresh it Peter

            Hey Peter,

            I wasn't planning on it since he +1'd it, unless there were other issues found, but I can if that's desired.

            • Kit
            kit.westneat Kit Westneat (Inactive) added a comment - Hey Peter, I wasn't planning on it since he +1'd it, unless there were other issues found, but I can if that's desired. Kit
            pjones Peter Jones added a comment -

            Kit

            Will you be refreshing the patch in light of Andreas's review feedback?

            Peter

            pjones Peter Jones added a comment - Kit Will you be refreshing the patch in light of Andreas's review feedback? Peter

            BTW the cause of the second bug is that if a new OST mounts before the MGC has pulled the nodemap config from the MGS, it creates a new blank config on disk. Part of that code was erroneously assuming that it was in the MGS, as normally all new records are created there and then sent to the OSTs, so it was returning an error. That's why the first OST failed to mount. When the other OSTs were mounted, the MGC was already connected to the MGS, so it was able to pull the config and save it properly. That's why the other OSTs were able to mount after rebooting, but nvme0n1 wasn't able to until the others were mounted.

            kit.westneat Kit Westneat (Inactive) added a comment - BTW the cause of the second bug is that if a new OST mounts before the MGC has pulled the nodemap config from the MGS, it creates a new blank config on disk. Part of that code was erroneously assuming that it was in the MGS, as normally all new records are created there and then sent to the OSTs, so it was returning an error. That's why the first OST failed to mount. When the other OSTs were mounted, the MGC was already connected to the MGS, so it was able to pull the config and save it properly. That's why the other OSTs were able to mount after rebooting, but nvme0n1 wasn't able to until the others were mounted.

            This patch is still a work in progress, but addresses both these issues.

            kit.westneat Kit Westneat (Inactive) added a comment - This patch is still a work in progress, but addresses both these issues.

            Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/22004
            Subject: LU-8508 nodemap: improve object handling in cache saving
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6753d578f44195ff6e4476266538887f6cd07712

            gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/22004 Subject: LU-8508 nodemap: improve object handling in cache saving Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6753d578f44195ff6e4476266538887f6cd07712

            Fail to mount the OST is another issue that is different from the original "ASSERTION( atomic_read(&d->ld_ref) == 0 )".

            The first target I try to mount which isn't on the same server as the MGT will fail and get stuck in this state. Not mounted but in a lock somewhere, its like it starts the service without a target.

            Have you mounted up the MGS before mounting the MDT or OTS? If no, please mount up the MGS (or say MGT on the MGS node) firstly. Otherwise, please enable -1 level Lustre kernel debug on both the MGS and OSS/MDS, then try again and attach the Lustre debug logs. Thanks!

            yong.fan nasf (Inactive) added a comment - Fail to mount the OST is another issue that is different from the original "ASSERTION( atomic_read(&d->ld_ref) == 0 )". The first target I try to mount which isn't on the same server as the MGT will fail and get stuck in this state. Not mounted but in a lock somewhere, its like it starts the service without a target. Have you mounted up the MGS before mounting the MDT or OTS? If no, please mount up the MGS (or say MGT on the MGS node) firstly. Otherwise, please enable -1 level Lustre kernel debug on both the MGS and OSS/MDS, then try again and attach the Lustre debug logs. Thanks!

            People

              kit.westneat Kit Westneat (Inactive)
              adam.j.roe Adam Roe (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: