Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.5.3
-
None
-
CentOS-6.7
lustre-2.5.3
lvm2-2.02.118-3.el6_7.4
Also note that the MDT uses an external journal device.
-
3
-
9223372036854775807
Description
Similar to LU-7616 "creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted", but opening a new case for tracking.
The goal is to use LVM snapshots and tar to make file level MDT backups. Procedure worked fine 2 or 3 times, then we triggered the following problem on a recent attempt.
The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command:
lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9
(the command never returned, and any further lv* commands hung as well)
In the logs...
Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted) Apr 25 17:14:45 nbp9-mds ] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]--- Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G W --------------- ) Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270 Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan] Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G W --------------- 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1 Apr 25 17:14:57 nbp9-mds kernel: Call Trace: Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0 Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20 Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs] Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt] Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1 Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent) Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent) Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush
The server had to be rebooted and e2fsck run to get it back into production.
Attachments
Issue Links
- is related to
-
LU-7616 creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted
-
- Open
-
The last six patches in ldiskfs-2.6-rhel6.6.series of the master branch are:
rhel6.3/ext4-drop-inode-from-orphan-list-if-ext4_delete_inode-fails.patch
rhel6.6/ext4-remove-truncate-warning.patch
rhel6.6/ext4-corrupted-inode-block-bitmaps-handling-patches.patch
rhel6.3/ext4-notalloc_under_idatasem.patch
rhel6.5/ext4-give-warning-with-dir-htree-growing.patch
rhel6.6/ext4_s_max_ext_tree_depth.patch
Only the first 2 patches have already been picked into b2_7_fe. All six have not been picked to b2_5_fe.
We are running Centos 6.6 and it seems to me these patches are important to have also. Some of our servers run 2.5.3 and the rest run 2.7.1. Is it safe for us to pick up those missing ldiskfs kernel patches? Please advise.