Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9334

LBUG lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.10.0
    • None
    • Soak stress cluster - Lustre-master build 3554 version=2.9.55_35_gaa32cc5
    • 3
    • 9223372036854775807

    Description

      After installing build 3554, attempting to re-mount the filesystem. On MGS/MDS node:

      Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): recovery complete
      Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      Apr 13 14:14:44 soak-8 kernel: Lustre: MGS: Connection restored to 7703d69c-e9bc-c7f1-f9af-daf27057ff03 (at 0@lo)
      Apr 13 14:14:44 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20
      Apr 13 14:14:44 soak-8 kernel: Lustre: Failing over soaked-MDT0000
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) LBUG
      Apr 13 14:14:44 soak-8 kernel: Pid: 5015, comm: mount.lustre
      Apr 13 14:14:44 soak-8 kernel: #012Call Trace:
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb861>] lbug_with_loc+0x41/0xb0 [libcfs]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0fe48>] lu_device_fini+0xb8/0xc0 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4b12>] ls_device_put+0x82/0x2a0 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4e0d>] local_oid_storage_fini+0xdd/0x210 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa129a308>] mgs_fs_cleanup+0x88/0xa0 [mgs]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa1292b96>] mgs_device_fini+0x196/0x5b0 [mgs]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cff724>] class_cleanup+0x7f4/0xd80 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d02364>] class_process_config+0x1f84/0x2c30 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf1659>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d030ff>] class_manual_cleanup+0xef/0x810 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d332e0>] server_put_super+0xb20/0xcd0 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d3708b>] server_fill_super+0xcdb/0x184c [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0f1d8>] lustre_fill_super+0x328/0x950 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0eeb0>] ? lustre_fill_super+0x0/0x950 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff81201b1d>] mount_nodev+0x4d/0xb0
      Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d06c78>] lustre_mount+0x38/0x60 [obdclass]
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff812024c9>] mount_fs+0x39/0x1b0
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff8121e24f>] vfs_kern_mount+0x5f/0xf0
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff812207ae>] do_mount+0x24e/0xaa0
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff81185a7e>] ? __get_free_pages+0xe/0x50
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff81221096>] SyS_mount+0x96/0xf0
      Apr 13 14:14:44 soak-8 kernel: [<ffffffff81696c49>] system_call_fastpath+0x16/0x1b
      Apr 13 14:14:45 soak-8 kernel:
      Apr 13 14:14:45 soak-8 kernel: Kernel panic - not syncing: LBUG
      

      vmcore-dmesg is attached, crash dump is available on soak-8

      Attachments

        1. 191916.dump.txt.gz
          259 kB
        2. 191952.dump.txt.gz
          61 kB
        3. 192002.dump.txt.gz
          25 kB
        4. 192012.dump.txt.gz
          17 kB
        5. 192022.dump.txt.gz
          17 kB
        6. 192032.dump.txt.gz
          19 kB
        7. 192042.dump.txt.gz
          3 kB
        8. 192052.dump.txt.gz
          18 kB
        9. 192102.dump.txt.gz
          18 kB
        10. 192112.dump.txt.gz
          19 kB
        11. vmcore-dmesg.txt
          129 kB

        Issue Links

          Activity

            [LU-9334] LBUG lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26703/
            Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1b136613440bd81e284a12df97618f92c9729d71

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26703/ Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1b136613440bd81e284a12df97618f92c9729d71

            Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26718/
            Subject: LU-9334 lfsck: reset trace file for upgrade case
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set:
            Commit: ba4a83aa9f72d7855ff77a37897619155313b198

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26718/ Subject: LU-9334 lfsck: reset trace file for upgrade case Project: fs/lustre-release Branch: pfl Current Patch Set: Commit: ba4a83aa9f72d7855ff77a37897619155313b198

            Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26718
            Subject: LU-9334 lfsck: reset trace file for upgrade case
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set: 1
            Commit: 740e28ad8fad95080da4b0275562378ef0b51cd3

            gerrit Gerrit Updater added a comment - Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26718 Subject: LU-9334 lfsck: reset trace file for upgrade case Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 740e28ad8fad95080da4b0275562378ef0b51cd3

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26716
            Subject: LU-9334 lfsck: reset trace file for upgrade case
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 48672c55bf83bbadec89bbd3406489251d701db0

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26716 Subject: LU-9334 lfsck: reset trace file for upgrade case Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 48672c55bf83bbadec89bbd3406489251d701db0

            Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26715/
            Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set:
            Commit: e62a57f86027582c8c9b56a7d3d348fb38510f4b

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26715/ Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file Project: fs/lustre-release Branch: pfl Current Patch Set: Commit: e62a57f86027582c8c9b56a7d3d348fb38510f4b

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26715
            Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
            Project: fs/lustre-release
            Branch: pfl
            Current Patch Set: 1
            Commit: 806549e6ba5e96059cb41d006290bfc70ce58709

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26715 Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file Project: fs/lustre-release Branch: pfl Current Patch Set: 1 Commit: 806549e6ba5e96059cb41d006290bfc70ce58709

            Tested latest patch

            Apr 18 16:26:55 soak-8 kernel: Lustre: MGS: Connection restored to c060ba9e-ea2b-a109-ec49-a8f45fcb1eaf (at 0@lo)
            Apr 18 16:26:56 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20
            Apr 18 16:26:56 soak-8 kernel: Lustre: Failing over soaked-MDT0000
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5
            Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages
            Apr 18 16:27:02 soak-8 kernel: Lustre: 12003:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492532816/real 1492532816]  req@ffff8803fc650000 x1565032298054672/t0(0) o251->MGC192.168.1.108@o2ib10@0@lo:26/25 lens 224/224 e 0 to 1 dl 1492532822 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
            Apr 18 16:27:02 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
            Apr 18 16:27:02 soak-8 kernel: LustreError: 12003:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount  (-20)
            Apr 18 16:27:02 soak-8 sshd[11981]: Received disconnect from 192.168.1.135: 11: disconnected by user
            Apr 18 16:27:02 soak-8 sshd[11981]: pam_unix(sshd:session): session closed for user root
            
            

            fails

            cliffw Cliff White (Inactive) added a comment - Tested latest patch Apr 18 16:26:55 soak-8 kernel: Lustre: MGS: Connection restored to c060ba9e-ea2b-a109-ec49-a8f45fcb1eaf (at 0@lo) Apr 18 16:26:56 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20 Apr 18 16:26:56 soak-8 kernel: Lustre: Failing over soaked-MDT0000 Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages Apr 18 16:27:02 soak-8 kernel: Lustre: 12003:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492532816/real 1492532816] req@ffff8803fc650000 x1565032298054672/t0(0) o251->MGC192.168.1.108@o2ib10@0@lo:26/25 lens 224/224 e 0 to 1 dl 1492532822 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 18 16:27:02 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Apr 18 16:27:02 soak-8 kernel: LustreError: 12003:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-20) Apr 18 16:27:02 soak-8 sshd[11981]: Received disconnect from 192.168.1.135: 11: disconnected by user Apr 18 16:27:02 soak-8 sshd[11981]: pam_unix(sshd:session): session closed for user root fails

            This is on the soak performance cluster - 4 MDS, 4 MDT in a DNE configuration.
            Formatted quite some time ago, actually.
            MDTs created with older 2.9 build

            mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=1 --servicenode=192.168.1.109@o2ib0:192.168.1.108@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026d52013098p1
            mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=3 --servicenode=192.168.1.111@o2ib0:192.168.1.110@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015752012949
            mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=2 --servicenode=192.168.1.110@o2ib0:192.168.1.111@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015552012932
            mkfs.lustre --mgs --fsname=soaked --mdt --index=0 --servicenode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1
            

            To produce the failure, I just mount the MDT. Mount script halts at that point.
            Does PFL not work on an existing filesystem? Does the filesystem have to be re-created with PFL code?

            cliffw Cliff White (Inactive) added a comment - This is on the soak performance cluster - 4 MDS, 4 MDT in a DNE configuration. Formatted quite some time ago, actually. MDTs created with older 2.9 build mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=1 --servicenode=192.168.1.109@o2ib0:192.168.1.108@o2ib0 --reformat --mkfsoptions= '-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026d52013098p1 mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=3 --servicenode=192.168.1.111@o2ib0:192.168.1.110@o2ib0 --reformat --mkfsoptions= '-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015752012949 mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=2 --servicenode=192.168.1.110@o2ib0:192.168.1.111@o2ib0 --reformat --mkfsoptions= '-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015552012932 mkfs.lustre --mgs --fsname=soaked --mdt --index=0 --servicenode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --reformat --mkfsoptions= '-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 To produce the failure, I just mount the MDT. Mount script halts at that point. Does PFL not work on an existing filesystem? Does the filesystem have to be re-created with PFL code?

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26703
            Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 6886869d700a7e84f7cc3dc5765096277c84c81e

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26703 Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 6886869d700a7e84f7cc3dc5765096277c84c81e

            On the other hand, is it only reproduced on the build 3554? or on other builds also?

            yong.fan nasf (Inactive) added a comment - On the other hand, is it only reproduced on the build 3554? or on other builds also?

            Cliff,

            Thanks for collecting the logs. Unfortunately, the logs do not contain any information about the failure in the bug description. So I have to try to reproduce it by myself. I have simulated the cases of lfsck_layout_setup() failure and osp_attr_get() failure. But still cannot reproduce it. So would you please to describe your test configuration, including the MDTs count and how to format the MDT devices? And also the steps (commands) to reproduce the issue?

            Thanks!

            yong.fan nasf (Inactive) added a comment - Cliff, Thanks for collecting the logs. Unfortunately, the logs do not contain any information about the failure in the bug description. So I have to try to reproduce it by myself. I have simulated the cases of lfsck_layout_setup() failure and osp_attr_get() failure. But still cannot reproduce it. So would you please to describe your test configuration, including the MDTs count and how to format the MDT devices? And also the steps (commands) to reproduce the issue? Thanks!

            People

              yong.fan nasf (Inactive)
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: