Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8071

lvcreate --snapshot of MDT hangs in ldiskfs_journal_start_sb

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.9.0
    • Lustre 2.5.3
    • None
    • CentOS-6.7
      lustre-2.5.3
      lvm2-2.02.118-3.el6_7.4
      Also note that the MDT uses an external journal device.
    • 3
    • 9223372036854775807

    Description

      Similar to LU-7616 "creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted", but opening a new case for tracking.

      The goal is to use LVM snapshots and tar to make file level MDT backups. Procedure worked fine 2 or 3 times, then we triggered the following problem on a recent attempt.

      The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command:

      lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9
      

      (the command never returned, and any further lv* commands hung as well)

      In the logs...

      Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted)
      Apr 25 17:14:45 nbp9-mds ]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]---
      
      Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G        W  ---------------   )
      Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270
      Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]
      Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
      Apr 25 17:14:57 nbp9-mds kernel: Call Trace:
      Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0
      Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20
      Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]
      Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt]
      Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]
      Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down
      Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS
      Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1
      Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent)
      Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd
      Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error
      Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent)
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
      Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
      
      Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use
      Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush
      

      The server had to be rebooted and e2fsck run to get it back into production.

      Attachments

        1. bt.all
          917 kB
        2. dmesg.out
          494 kB
        3. lostfound_nonzero.lst
          80 kB
        4. nagtest.toobig.stripes
          36 kB

        Issue Links

          Activity

            [LU-8071] lvcreate --snapshot of MDT hangs in ldiskfs_journal_start_sb
            pjones Peter Jones added a comment -

            Thanks Jay!

            pjones Peter Jones added a comment - Thanks Jay!

            Yes, this ticket can be closed. Thanks!

            jaylan Jay Lan (Inactive) added a comment - Yes, this ticket can be closed. Thanks!
            pjones Peter Jones added a comment -

            Now landed for 2.9 and queued up for maintenance releases. Is there anything further you need on this ticket or can it be closed?

            pjones Peter Jones added a comment - Now landed for 2.9 and queued up for maintenance releases. Is there anything further you need on this ticket or can it be closed?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20062/
            Subject: LU-8071 ldiskfs: handle system freeze protection
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: bd40ca206881eefeeb6ad7586f93afd685bb8120

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/20062/ Subject: LU-8071 ldiskfs: handle system freeze protection Project: fs/lustre-release Branch: master Current Patch Set: Commit: bd40ca206881eefeeb6ad7586f93afd685bb8120

            I would recommend against using the ext4-give-warning-with-dir-htree-growing.patch as this also requires other changes to the Lustre code. The other changes are OK to use on other kernels.

            Also, are the following patches from the upstream kernel already applied on your systems?

            commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f
            Author:     Eric Sandeen <sandeen@sandeen.net>
            AuthorDate: Sun Aug 1 17:33:29 2010 -0400
            Commit:     Theodore Ts'o <tytso@mit.edu>
            CommitDate: Sun Aug 1 17:33:29 2010 -0400
            
                ext4: fix freeze deadlock under IO
                
                Commit 6b0310fbf087ad6 caused a regression resulting in deadlocks
                when freezing a filesystem which had active IO; the vfs_check_frozen
                level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
                through.  Duh.
                
                Changing the test to FREEZE_TRANS should let the normal freeze
                syncing get through the fs, but still block any transactions from
                starting once the fs is completely frozen.
                
                I tested this by running fsstress in the background while periodically
                snapshotting the fs and running fsck on the result.  I ran into
                occasional deadlocks, but different ones.  I think this is a
                fine fix for the problem at hand, and the other deadlocky things
                will need more investigation.
                
                Reported-by: Phillip Susi <psusi@cfl.rr.com>
                Signed-off-by: Eric Sandeen <sandeen@redhat.com>
                Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
            
            commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
            Author:     Eric Sandeen <sandeen@redhat.com>
            AuthorDate: Sun May 16 02:00:00 2010 -0400
            Commit:     Theodore Ts'o <tytso@mit.edu>
            CommitDate: Sun May 16 02:00:00 2010 -0400
            
                ext4: don't return to userspace after freezing the fs with a mutex held
                
                ext4_freeze() used jbd2_journal_lock_updates() which takes
                the j_barrier mutex, and then returns to userspace.  The
                kernel does not like this:
                
                ================================================
                [ BUG: lock held when returning to user space! ]
                ------------------------------------------------
                lvcreate/1075 is leaving the kernel with locks still held!
                1 lock held by lvcreate/1075:
                 #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
                jbd2_journal_lock_updates+0xe1/0xf0
                
                Use vfs_check_frozen() added to ext4_journal_start_sb() and
                ext4_force_commit() instead.
                
                Addresses-Red-Hat-Bugzilla: #568503
                
                Signed-off-by: Eric Sandeen <sandeen@redhat.com>
                Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
            

            I would guess yes, since they originated from Red Hat, but just wanted to confirm.

            adilger Andreas Dilger added a comment - I would recommend against using the ext4-give-warning-with-dir-htree-growing.patch as this also requires other changes to the Lustre code. The other changes are OK to use on other kernels. Also, are the following patches from the upstream kernel already applied on your systems? commit 437f88cc031ffe7f37f3e705367f4fe1f4be8b0f Author: Eric Sandeen <sandeen@sandeen.net> AuthorDate: Sun Aug 1 17:33:29 2010 -0400 Commit: Theodore Ts'o <tytso@mit.edu> CommitDate: Sun Aug 1 17:33:29 2010 -0400 ext4: fix freeze deadlock under IO Commit 6b0310fbf087ad6 caused a regression resulting in deadlocks when freezing a filesystem which had active IO; the vfs_check_frozen level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing through. Duh. Changing the test to FREEZE_TRANS should let the normal freeze syncing get through the fs, but still block any transactions from starting once the fs is completely frozen. I tested this by running fsstress in the background while periodically snapshotting the fs and running fsck on the result. I ran into occasional deadlocks, but different ones. I think this is a fine fix for the problem at hand, and the other deadlocky things will need more investigation. Reported-by: Phillip Susi <psusi@cfl.rr.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084 Author: Eric Sandeen <sandeen@redhat.com> AuthorDate: Sun May 16 02:00:00 2010 -0400 Commit: Theodore Ts'o <tytso@mit.edu> CommitDate: Sun May 16 02:00:00 2010 -0400 ext4: don't return to userspace after freezing the fs with a mutex held ext4_freeze() used jbd2_journal_lock_updates() which takes the j_barrier mutex, and then returns to userspace. The kernel does not like this: ================================================ [ BUG: lock held when returning to user space! ] ------------------------------------------------ lvcreate/1075 is leaving the kernel with locks still held! 1 lock held by lvcreate/1075: #0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>] jbd2_journal_lock_updates+0xe1/0xf0 Use vfs_check_frozen() added to ext4_journal_start_sb() and ext4_force_commit() instead. Addresses-Red-Hat-Bugzilla: #568503 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> I would guess yes, since they originated from Red Hat, but just wanted to confirm.

            The last six patches in ldiskfs-2.6-rhel6.6.series of the master branch are:

            rhel6.3/ext4-drop-inode-from-orphan-list-if-ext4_delete_inode-fails.patch
            rhel6.6/ext4-remove-truncate-warning.patch
            rhel6.6/ext4-corrupted-inode-block-bitmaps-handling-patches.patch
            rhel6.3/ext4-notalloc_under_idatasem.patch
            rhel6.5/ext4-give-warning-with-dir-htree-growing.patch
            rhel6.6/ext4_s_max_ext_tree_depth.patch

            Only the first 2 patches have already been picked into b2_7_fe. All six have not been picked to b2_5_fe.

            We are running Centos 6.6 and it seems to me these patches are important to have also. Some of our servers run 2.5.3 and the rest run 2.7.1. Is it safe for us to pick up those missing ldiskfs kernel patches? Please advise.

            jaylan Jay Lan (Inactive) added a comment - The last six patches in ldiskfs-2.6-rhel6.6.series of the master branch are: rhel6.3/ext4-drop-inode-from-orphan-list-if-ext4_delete_inode-fails.patch rhel6.6/ext4-remove-truncate-warning.patch rhel6.6/ext4-corrupted-inode-block-bitmaps-handling-patches.patch rhel6.3/ext4-notalloc_under_idatasem.patch rhel6.5/ext4-give-warning-with-dir-htree-growing.patch rhel6.6/ext4_s_max_ext_tree_depth.patch Only the first 2 patches have already been picked into b2_7_fe. All six have not been picked to b2_5_fe. We are running Centos 6.6 and it seems to me these patches are important to have also. Some of our servers run 2.5.3 and the rest run 2.7.1. Is it safe for us to pick up those missing ldiskfs kernel patches? Please advise.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20062
            Subject: LU-8071 ldiskfs: handle system freeze protection
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: b61bdfcaef1b03f8d0f082d57120681d71ab5e40

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/20062 Subject: LU-8071 ldiskfs: handle system freeze protection Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: b61bdfcaef1b03f8d0f082d57120681d71ab5e40

            The fsfreeze is the same as what LVM does internally, so it wouldn't change the problem being seen. We don't plumb that ioctl through the MDT mountpoint to the underlying filesystem, which is why there is an error reported.

            As for fixing this problem, I suspect that some of the upstream ext4 patches need to be backported (in particular kernel commits 437f88cc031ffe7f and 6b0310fbf087ad6) but there is also likely a need to add checks into the osd-ldiskfs code to block new IO submissions when the device is freezing/frozen, since these do not have the VFS-level checks.

            adilger Andreas Dilger added a comment - The fsfreeze is the same as what LVM does internally, so it wouldn't change the problem being seen. We don't plumb that ioctl through the MDT mountpoint to the underlying filesystem, which is why there is an error reported. As for fixing this problem, I suspect that some of the upstream ext4 patches need to be backported (in particular kernel commits 437f88cc031ffe7f and 6b0310fbf087ad6) but there is also likely a need to add checks into the osd-ldiskfs code to block new IO submissions when the device is freezing/frozen, since these do not have the VFS-level checks.
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited

            At some point in the future we plan to migrate to RHEL/CentOS-7, but right now CentOS-6.7 is the priority. It looks like there is additional code that uses vfs_check_frozen in newer ext4/super.c source, but that it is not in our 2.6.32-504.30 kernel, nor is it mentioned in the changelog for a very recent CentOS-6.7 2.6.32-573.22 kernel. Are you saying that we just need to incorporate all the existing fixes, or that a new lustre patch is needed to perform a check which is unique to ldiskfs?

            If it would be more expedient to use some sort of external pause/resume commands before and after the lvm snapshot is created, then we could work that into the process as well.

            Also, just curious if this should work for a running MDT or ldiskfs:

            service322 ~ # fsfreeze -f /mnt/lustre/nbptest-mdt
            fsfreeze: /mnt/lustre/nbptest-mdt: freeze failed: Operation not supported
            
            ndauchy Nathan Dauchy (Inactive) added a comment - - edited At some point in the future we plan to migrate to RHEL/CentOS-7, but right now CentOS-6.7 is the priority. It looks like there is additional code that uses vfs_check_frozen in newer ext4/super.c source, but that it is not in our 2.6.32-504.30 kernel, nor is it mentioned in the changelog for a very recent CentOS-6.7 2.6.32-573.22 kernel. Are you saying that we just need to incorporate all the existing fixes, or that a new lustre patch is needed to perform a check which is unique to ldiskfs? If it would be more expedient to use some sort of external pause/resume commands before and after the lvm snapshot is created, then we could work that into the process as well. Also, just curious if this should work for a running MDT or ldiskfs: service322 ~ # fsfreeze -f /mnt/lustre/nbptest-mdt fsfreeze: /mnt/lustre/nbptest-mdt: freeze failed: Operation not supported

            It appears that we may be missing calls in osd-ldiskfs to check if the block device is currently (being) frozen before submitting IO to the underlying storage (i.e. all modifications blocked at a barrier while in the middle of creating a snapshot.

            In newer kernels (3.5+, e.g. RHEL7) this is handled with sb_start_write() and sb_end_write() at the upper layers, and sb_start_intwrite() and sb_end_intwrite() for filesystem internal writes, and should be taken before i_mutex (see e.g. kernel commit 8e8ad8a57c75f3b for examples).

            In older kernels (e.g. RHEL6) it uses vfs_check_frozen() to see if the device is currently frozen, with level=SB_FREEZE_WRITE to block new modifications from starting, and level=SB_FREEZE_TRANS to only block internal threads if the freeze barrier is complete.

            adilger Andreas Dilger added a comment - It appears that we may be missing calls in osd-ldiskfs to check if the block device is currently (being) frozen before submitting IO to the underlying storage (i.e. all modifications blocked at a barrier while in the middle of creating a snapshot. In newer kernels (3.5+, e.g. RHEL7) this is handled with sb_start_write() and sb_end_write() at the upper layers, and sb_start_intwrite() and sb_end_intwrite() for filesystem internal writes, and should be taken before i_mutex (see e.g. kernel commit 8e8ad8a57c75f3b for examples). In older kernels (e.g. RHEL6) it uses vfs_check_frozen() to see if the device is currently frozen, with level=SB_FREEZE_WRITE to block new modifications from starting, and level=SB_FREEZE_TRANS to only block internal threads if the freeze barrier is complete.

            People

              yong.fan nasf (Inactive)
              ndauchy Nathan Dauchy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: