Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-7616

creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.7.0
    • None
    • 3
    • 9223372036854775807

    Description

      trying to create a MDT snapshot via LVM (for backups) with the following command:

      lvcreate --size 20G --snapshot --name lv-mdt0_snap /dev/vg-3015/lv-mdt0

      immediately following, errors are logged like this:

      Dec 28 09:24:32 mds01 kernel: ------------[ cut here ]------------
      Dec 28 09:24:32 mds01 kernel: WARNING: at /var/lib/jenkins/workspace/lustre-b2_7/arch/x86_64/build_type/server/distro/el6.6/ib_stack/inkernel/BUILD/BUILD/lustre-2.7.0/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: P        W  ---------------   )
      Dec 28 09:24:32 mds01 kernel: Hardware name: ProLiant DL380p Gen8
      Dec 28 09:24:32 mds01 kernel: Modules linked in: ext2 osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) mdc(U) fid(U) lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic crc32c_intel libcfs(U) ip6table_filter ip6_tables ebtable_nat ebtables bridge stp llc ipt_REJECT ipt_LOG xt_limit nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate dm_snapshot dm_bufio dm_round_robin dm_multipath vhost_net macvtap macvlan tun kvm_intel kvm iTCO_wdt iTCO_vendor_support microcode hpilo hpwdt power_meter acpi_ipmi ipmi_si ipmi_msghandler tg3 sg serio_raw lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif mlx4_ib ib_sa ib_mad ib_core ib_addr ipv6 mlx4_en ptp pps_core mlx4_core hpsa(U) qla2xxx scsi_transport_fc scsi_tgt pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_
      Dec 28 09:24:32 mds01 kernel: log dm_mod [last unloaded: scsi_wait_scan]
      Dec 28 09:24:32 mds01 kernel: Pid: 5300, comm: mdt00_005 Tainted: P        W  ---------------    2.6.32-504.8.1.el6_lustre.x86_64 #1
      Dec 28 09:24:32 mds01 kernel: Call Trace:
      Dec 28 09:24:32 mds01 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
      Dec 28 09:24:32 mds01 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
      Dec 28 09:24:32 mds01 kernel: [<ffffffffa0e9234e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]
      Dec 28 09:24:32 mds01 kernel: [<ffffffffa0e6c23a>] ? ldiskfs_dirty_inode+0x2a/0x60 [ldiskfs]
      Dec 28 09:24:32 mds01 kernel: [<ffffffffa1050966>] ? osd_attr_set+0x166/0x460 [osd_ldiskfs]
      Dec 28 09:25:03 mds01 kernel: ------------[ cut here ]------------
      

      The lvcreate command does return if the MDT activity is halted (unmount clients).

      Attachments

        Issue Links

          Activity

            [LU-7616] creation of LVM snapshot on ldiskfs based MDT hangs until MDT activity/use is halted

            Not sure how closely tied this is, but... after the reboot we ran e2fsck on the MDT. The fsck was able replay the journal but during the scan found lots of unattached inodes and moved them to /lost+found. Interestingly, most of those inodes were almost sequential:

            Unattached inode 584092565
            Connect to /lost+found<y>? yes
            Unattached inode 584092567
            Connect to /lost+found<y>? yes
            Unattached inode 584092637
            Connect to /lost+found<y>? yes
            Unattached inode 584092640
            Connect to /lost+found<y>? yes
            Unattached inode 584092643
            Connect to /lost+found<y>? yes
            Unattached inode 584092645
            Connect to /lost+found<y>? yes
            Unattached inode 584092648
            Connect to /lost+found<y>? yes
            Unattached inode 584092715
            Connect to /lost+found<y>? yes
            Unattached inode 584092717
            Connect to /lost+found<y>? yes
            Unattached inode 584092720
            Connect to /lost+found<y>? yes
            Unattached inode 584092723
            Connect to /lost+found<y>? yes
            Unattached inode 584092784
            

            Mounted as ldiskfs and saw that all but about 20 files were owned by root and had surprisingly old date stamps:

            nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found/
            total 6816
            -rw------- 1 root     root   5792 Jul 26  1970 #17831719
            -rw------- 1 root     root   5792 Jul 26  1970 #17831723
            -rw------- 1 root     root   5792 Feb  4  1988 #571005637
            -rw------- 1 root     root   5792 Feb  4  1988 #571016437
            -rw------- 1 root     root   5792 Feb  4  1988 #571029133
            ...
            -rw------- 1 root     root   4832 Jul  5  1988 #584109225
            -rw------- 1 root     root   4832 Jul  5  1988 #584109227
            -rw------- 1 root     root   4832 Jul  5  1988 #584109231
            -rw------- 1 root     root   4832 Jul  5  1988 #584109233
            -rw------- 1 root     root   4832 Jul  5  1988 #584109235
            
            nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found | grep -v " root "
            total 47328
            -rwxr----- 1 dmenemen g26209    0 Feb 10  2014 #570436464
            -rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436465
            -rwxr----- 1 dmenemen g26209    0 Feb 10  2014 #570436466
            -rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436467
            -rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436468
            -rw-r--r-- 1 dmenemen g26209    0 Feb 10  2014 #570436469
            ...
            (another dozen or so more similar files)
            

            Ran "ll_recover_lost_found_objs" on the MDT mounted as ldiskfs, and it reported results like:

            Object /mnt/lustre/nbp9-mdt/O/22d0c13d/d23/cb410817 restored.
            Object /mnt/lustre/nbp9-mdt/O/22d0c13f/d27/cb41083b restored.
            Object /mnt/lustre/nbp9-mdt/O/22d0c145/d1/cb504661 restored.
            Object /mnt/lustre/nbp9-mdt/O/22d0c147/d27/cb50467b restored.
            Object /mnt/lustre/nbp9-mdt/O/22d0c153/d16/cb504810 restored.
            Object /mnt/lustre/nbp9-mdt/O/22d0c155/d20/cb504814 restored.
            

            (It got all the non-root owned files)

            I believe that the rest of the files in lost+found were probably either useless or were never pointing to user data... they have no xattrs:

            # getfattr -d -m ".*" -e hex /mnt/lustre/nbp9-mdt/lost+found/*
            

            (returned nothing!)

            So, we ran another e2fsck (which came back clean!) and put the filesystem back into production.

            ndauchy Nathan Dauchy (Inactive) added a comment - Not sure how closely tied this is, but... after the reboot we ran e2fsck on the MDT. The fsck was able replay the journal but during the scan found lots of unattached inodes and moved them to /lost+found. Interestingly, most of those inodes were almost sequential: Unattached inode 584092565 Connect to /lost+found<y>? yes Unattached inode 584092567 Connect to /lost+found<y>? yes Unattached inode 584092637 Connect to /lost+found<y>? yes Unattached inode 584092640 Connect to /lost+found<y>? yes Unattached inode 584092643 Connect to /lost+found<y>? yes Unattached inode 584092645 Connect to /lost+found<y>? yes Unattached inode 584092648 Connect to /lost+found<y>? yes Unattached inode 584092715 Connect to /lost+found<y>? yes Unattached inode 584092717 Connect to /lost+found<y>? yes Unattached inode 584092720 Connect to /lost+found<y>? yes Unattached inode 584092723 Connect to /lost+found<y>? yes Unattached inode 584092784 Mounted as ldiskfs and saw that all but about 20 files were owned by root and had surprisingly old date stamps: nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found/ total 6816 -rw------- 1 root root 5792 Jul 26 1970 #17831719 -rw------- 1 root root 5792 Jul 26 1970 #17831723 -rw------- 1 root root 5792 Feb 4 1988 #571005637 -rw------- 1 root root 5792 Feb 4 1988 #571016437 -rw------- 1 root root 5792 Feb 4 1988 #571029133 ... -rw------- 1 root root 4832 Jul 5 1988 #584109225 -rw------- 1 root root 4832 Jul 5 1988 #584109227 -rw------- 1 root root 4832 Jul 5 1988 #584109231 -rw------- 1 root root 4832 Jul 5 1988 #584109233 -rw------- 1 root root 4832 Jul 5 1988 #584109235 nbp9-mds ~ # ls -l /mnt/lustre/nbp9-mdt/lost+found | grep -v " root " total 47328 -rwxr----- 1 dmenemen g26209 0 Feb 10 2014 #570436464 -rw-r--r-- 1 dmenemen g26209 0 Feb 10 2014 #570436465 -rwxr----- 1 dmenemen g26209 0 Feb 10 2014 #570436466 -rw-r--r-- 1 dmenemen g26209 0 Feb 10 2014 #570436467 -rw-r--r-- 1 dmenemen g26209 0 Feb 10 2014 #570436468 -rw-r--r-- 1 dmenemen g26209 0 Feb 10 2014 #570436469 ... (another dozen or so more similar files) Ran "ll_recover_lost_found_objs" on the MDT mounted as ldiskfs, and it reported results like: Object /mnt/lustre/nbp9-mdt/O/22d0c13d/d23/cb410817 restored. Object /mnt/lustre/nbp9-mdt/O/22d0c13f/d27/cb41083b restored. Object /mnt/lustre/nbp9-mdt/O/22d0c145/d1/cb504661 restored. Object /mnt/lustre/nbp9-mdt/O/22d0c147/d27/cb50467b restored. Object /mnt/lustre/nbp9-mdt/O/22d0c153/d16/cb504810 restored. Object /mnt/lustre/nbp9-mdt/O/22d0c155/d20/cb504814 restored. (It got all the non-root owned files) I believe that the rest of the files in lost+found were probably either useless or were never pointing to user data... they have no xattrs: # getfattr -d -m ".*" -e hex /mnt/lustre/nbp9-mdt/lost+found/* (returned nothing!) So, we ran another e2fsck (which came back clean!) and put the filesystem back into production.

            We saw a very similar issue at NASA yesterday... and we believe there may be some bugs in journal locking when creating snapshots.

            Using CentOS-6.7, lustre-2.5.3, lvm2-2.02.118-3.el6_7.4. Also note that the MDT uses an external journal device.

            The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command:

            lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9
            

            (the command never returned, and any further lv* commands hung as well)

            In the logs...

            Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted)
            Apr 25 17:14:45 nbp9-mds ]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
            Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]---
            
            Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/super.c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G        W  ---------------   )
            Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270
            Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan]
            Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G        W  ---------------    2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1
            Apr 25 17:14:57 nbp9-mds kernel: Call Trace:
            Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0
            Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20
            Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]
            Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt]
            Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt]
            Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down
            Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS
            Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1
            Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent)
            Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd
            Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error
            Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent)
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
            Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20
            
            Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use
            Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush
            

            The server had to be rebooted and e2fsck run to get it back into production.

            ndauchy Nathan Dauchy (Inactive) added a comment - We saw a very similar issue at NASA yesterday... and we believe there may be some bugs in journal locking when creating snapshots. Using CentOS-6.7, lustre-2.5.3, lvm2-2.02.118-3.el6_7.4. Also note that the MDT uses an external journal device. The MDS became extremely sluggish, and all MDT threads went into D state, when running the following command: lvcreate -l95%FREE -s -p r -n mdt_snap /dev/nbp9-vg/mdt9 (the command never returned, and any further lv* commands hung as well) In the logs... Apr 25 17:09:35 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/ super .c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Not tainted) Apr 25 17:14:45 nbp9-mds ] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Apr 25 17:14:45 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Apr 25 17:14:45 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Apr 25 17:14:45 nbp9-mds kernel: ---[ end trace c9f3339c0e103edf ]--- Apr 25 17:14:57 nbp9-mds kernel: WARNING: at /usr/src/redhat/BUILD/lustre-2.5.3/ldiskfs/ super .c:280 ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs]() (Tainted: G W --------------- ) Apr 25 17:14:57 nbp9-mds kernel: Hardware name: AltixXE270 Apr 25 17:14:57 nbp9-mds kernel: Modules linked in: dm_snapshot dm_bufio osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) jbd2 lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic crc32c_intel libcfs(U) sunrpc bonding ib_ucm(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) configfs ib_ipoib(U) ib_cm(U) ib_uverbs(U) ib_umad(U) dm_round_robin scsi_dh_rdac dm_multipath microcode iTCO_wdt iTCO_vendor_support i2c_i801 lpc_ich mfd_core shpchp sg igb dca ptp pps_core tcp_bic ext3 jbd sd_mod crc_t10dif sr_mod cdrom ahci pata_acpi ata_generic pata_jmicron mptfc scsi_transport_fc scsi_tgt mptsas mptscsih mptbase scsi_transport_sas mlx4_ib(U) ib_sa(U) ib_mad(U) ib_core(U) ib_addr(U) ipv6 mlx4_core(U) mlx_compat(U) memtrack(U) usb_storage radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod gru [last unloaded: scsi_wait_scan] Apr 25 17:14:57 nbp9-mds kernel: Pid: 85906, comm: mdt_rdpg00_042 Tainted: G W --------------- 2.6.32-504.30.3.el6.20151008.x86_64.lustre253 #1 Apr 25 17:14:57 nbp9-mds kernel: Call Trace: Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff81074127>] ? warn_slowpath_common+0x87/0xc0 Apr 25 17:14:57 nbp9-mds kernel: [<ffffffff8107417a>] ? warn_slowpath_null+0x1a/0x20 Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0a1c33e>] ? ldiskfs_journal_start_sb+0xce/0xe0 [ldiskfs] Apr 25 17:14:57 nbp9-mds kernel: [<ffffffffa0d6069f>] ? osd_trans_start+0x1df/0x660 [osd_ldiskfs] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0ef3619>] ? lod_trans_start+0x1b9/0x250 [lod] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f7af07>] ? mdd_trans_start+0x17/0x20 [mdd] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0f61ece>] ? mdd_close+0x6be/0xb80 [mdd] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e48be9>] ? mdt_mfd_close+0x4a9/0x1bc0 [mdt] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0899525>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa08c07f6>] ? __req_capsule_get+0x166/0x710 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa089a53e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa06ebf05>] ? class_handle2object+0x95/0x190 [obdclass] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e4b6a2>] ? mdt_close+0x642/0xa80 [mdt] Apr 25 17:15:06 nbp9-mds kernel: [<ffffffffa0e1fada>] ? mdt_handle_common+0x52a/0x1470 [mdt] Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: sdc - rdac checker reports path is down Apr 25 17:15:10 nbp9-mds multipathd: checker failed path 8:32 in map nbp9_MGS_MDS Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: remaining active paths: 1 Apr 25 17:15:10 nbp9-mds multipathd: sdd: remove path (uevent) Apr 25 17:15:10 nbp9-mds multipathd: nbp9_MGS_MDS: failed in domap for removal of path sdd Apr 25 17:15:10 nbp9-mds multipathd: uevent trigger error Apr 25 17:15:10 nbp9-mds multipathd: sdc: remove path (uevent) Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa0e5c4e5>] ? mds_readpage_handle+0x15/0x20 [mdt] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a90c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa05d18d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08a1a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08ab89d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c28a>] ? child_rip+0xa/0x20 Apr 25 17:15:10 nbp9-mds kernel: [<ffffffffa08aada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Apr 25 17:15:10 nbp9-mds kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: map in use Apr 25 17:15:16 nbp9-mds multipathd: nbp9_MGS_MDS: can't flush The server had to be rebooted and e2fsck run to get it back into production.

            People

              wc-triage WC Triage
              thcrowe tom crowe
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: