Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5764

Crash of MDS on "apparent buffer overflow"

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.5.3
    • None
    • 3
    • 16172

    Description

      Hi,

      On a file system with 300 OSTs, messages about "apparent buffer overflow" can be seen in the syslog of the MDS, and after some time (between 5 and 30 minutes), the MDS crashes.

      Here is the console output:

      <3>proc_file_read: Apparent buffer overflow!
      <3>proc_file_read: Apparent buffer overflow!
      <3>proc_file_read: Apparent buffer overflow!
      <3>proc_file_read: Apparent buffer overflow!
      <3>proc_file_read: Apparent buffer overflow!
      <3>proc_file_read: Apparent buffer overflow!
      <4>------------[ cut here ]------------
      <4>WARNING: at lib/list_debug.c:48 list_del+0x6e/0xa0() (Not tainted)
      <4>Hardware name: bullx
      <4>list_del corruption. prev->next should be ffff880335386000, but was 4d2d3030332d7366
      <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin scsi_dh_emc dm_multipath mic(U) uinput ses enclosure serio_raw compat(U) cxgb3 mdio lpfc scsi_transport_fc scsi_tgt igb i2c_algo_bit i2c_core ptp pps_core sg lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom aacraid ata_generic pata_jmicron usb_storage ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
      <4>Pid: 28, comm: events/1 Not tainted 2.6.32-431.29.2.el6.Bull.58.x86_64 #1
      <4>Call Trace:
      <4> [<ffffffff81070e77>] ? warn_slowpath_common+0x87/0xc0
      <4> [<ffffffff81070f66>] ? warn_slowpath_fmt+0x46/0x50
      <4> [<ffffffff8129593e>] ? list_del+0x6e/0xa0
      <4> [<ffffffff81171008>] ? free_block+0xc8/0x180
      <4> [<ffffffff811712f1>] ? drain_array+0xc1/0x100
      <4> [<ffffffff811721de>] ? cache_reap+0x8e/0x250
      <4> [<ffffffff81172150>] ? cache_reap+0x0/0x250
      <4> [<ffffffff81093d80>] ? worker_thread+0x170/0x2a0
      <4> [<ffffffff8109a300>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff81093c10>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81099f56>] ? kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
      <4> [<ffffffff81099ec0>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>---[ end trace cc0bf07e83b7a669 ]---
      <4>general protection fault: 0000 [#1] SMP
      <4>last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/host12/rport-12:0-0/target12:0:0/12:0:0:19/state
      <4>CPU 1
      <4>Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) lquota(U) fid(U) fld(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipmi_devintf acpi_cpufreq freq_table mperf rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ipv6 ib_uverbs(U) ib_umad(U) mlx4_ib(U) ib_sa(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) dm_round_robin scsi_dh_emc dm_multipath mic(U) uinput ses enclosure serio_raw compat(U) cxgb3 mdio lpfc scsi_transport_fc scsi_tgt igb i2c_algo_bit i2c_core ptp pps_core sg lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom aacraid ata_generic pata_jmicron usb_storage ahci dm_mirror dm_region_hash dm_log dm_mod megaraid_sas [last unloaded: scsi_wait_scan]
      <4>
      <4>Pid: 28, comm: events/1 Tainted: G        W  ---------------    2.6.32-431.29.2.el6.Bull.58.x86_64 #1 Bull SAS bullx/X8DAH
      <4>RIP: 0010:[<ffffffff812958e0>]  [<ffffffff812958e0>] list_del+0x10/0xa0
      <4>RSP: 0018:ffff88033acf5d10  EFLAGS: 00010082
      <4>RAX: 6c2d303030305444 RBX: ffff88032d169000 RCX: 000000000000100c
      <4>RDX: ffff88033fee0340 RSI: ffff88032d174000 RDI: ffff88032d169000
      <4>RBP: ffff88033acf5d20 R08: ffff88033fee0340 R09: 0000000000000006
      <4>R10: 0000000000000001 R11: 0000000000000000 R12: 000000000000000b
      <4>R13: ffff88033ac11e58 R14: 0000000000000008 R15: ffffea0000000000
      <4>FS:  0000000000000000(0000) GS:ffff880028220000(0000) knlGS:0000000000000000
      <4>CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      <4>CR2: 00007f5482ad7000 CR3: 000000062e1c4000 CR4: 00000000000007e0
      <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      <4>Process events/1 (pid: 28, threadinfo ffff88033acf4000, task ffff88033acccb40)
      <4>Stack:
      <4> 000000000000000b ffff88063cd30400 ffff88033acf5d80 ffffffff81171008
      <4><d> ffff88033fee0340 ffff88032d169000 000000000000100c ffff88032d169b40
      <4><d> 0000000000016cc0 ffff88033ac11e00 ffff88063cd30400 000000000000000b
      <4>Call Trace:
      <4> [<ffffffff81171008>] free_block+0xc8/0x180
      <4> [<ffffffff811712f1>] drain_array+0xc1/0x100
      <4> [<ffffffff811721de>] cache_reap+0x8e/0x250
      <4> [<ffffffff81172150>] ? cache_reap+0x0/0x250
      <4> [<ffffffff81093d80>] worker_thread+0x170/0x2a0
      <4> [<ffffffff8109a300>] ? autoremove_wake_function+0x0/0x40
      <4> [<ffffffff81093c10>] ? worker_thread+0x0/0x2a0
      <4> [<ffffffff81099f56>] kthread+0x96/0xa0
      <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
      <4> [<ffffffff81099ec0>] ? kthread+0x0/0xa0
      <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
      <4>Code: 89 95 fc fe ff ff e9 ab fd ff ff 4c 8b ad e8 fe ff ff e9 db fd ff ff 90 90 90 90 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 47 08 <4c> 8b 00 4c 39 c7 75 39 48 8b 03 4c 8b 40 08 4c 39 c3 75 4c 48
      <1>RIP  [<ffffffff812958e0>] list_del+0x10/0xa0
      <4> RSP <ffff88033acf5d10>
      

      This issue seems to be related to LU-4483, but unfortunately there is no fix yet.

      Thanks,
      Sebastien.

      Attachments

        Issue Links

          Activity

            [LU-5764] Crash of MDS on "apparent buffer overflow"

            Hi Peter,

            I can confirm that the issue does not show up with the patch at http://review.whamcloud.com/13413.

            So this ticket can be marked as resolved.

            Thanks,
            Sebastien.

            sebastien.buisson Sebastien Buisson (Inactive) added a comment - Hi Peter, I can confirm that the issue does not show up with the patch at http://review.whamcloud.com/13413 . So this ticket can be marked as resolved. Thanks, Sebastien.
            pjones Peter Jones added a comment -

            Can this ticket be marked as resolved?

            pjones Peter Jones added a comment - Can this ticket be marked as resolved?

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/13413
            Subject: LU-5764 proc: Crash of MDS on "apparent buffer overflow"
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: b2efa789d020a6f6e9aee1078fe41977ad4895a3

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/13413 Subject: LU-5764 proc: Crash of MDS on "apparent buffer overflow" Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: b2efa789d020a6f6e9aee1078fe41977ad4895a3
            ys Yang Sheng added a comment -

            Hi, YuJian, I see. So i'll doing this work.

            ys Yang Sheng added a comment - Hi, YuJian, I see. So i'll doing this work.
            yujian Jian Yu added a comment -

            Yujian, could you please comment if you have a plan port http://review.whamcloud.com/#/c/7933/ to 2.5?

            Hi Yang Sheng,
            I only back-ported one LU-3319 patch in http://review.whamcloud.com/11945 in order to resolve some conflicts while back-porting LU-3386 patches to Lustre b2_5 branch to resolve LU-5567. However, the back-porting work was suspended in http://review.whamcloud.com/11948 because more patches were needed to resolve the regression failures. If http://review.whamcloud.com/7933 was not a requirement to resolve LU-5567, I think I would not back-port it.

            Could you please proceed to back-port the required patches to resolve this ticket?

            yujian Jian Yu added a comment - Yujian, could you please comment if you have a plan port http://review.whamcloud.com/#/c/7933/ to 2.5? Hi Yang Sheng, I only back-ported one LU-3319 patch in http://review.whamcloud.com/11945 in order to resolve some conflicts while back-porting LU-3386 patches to Lustre b2_5 branch to resolve LU-5567 . However, the back-porting work was suspended in http://review.whamcloud.com/11948 because more patches were needed to resolve the regression failures. If http://review.whamcloud.com/7933 was not a requirement to resolve LU-5567 , I think I would not back-port it. Could you please proceed to back-port the required patches to resolve this ticket?
            ys Yang Sheng added a comment -

            Many thanks. James. Yujian, could you please comment if you have a plan port http://review.whamcloud.com/#/c/7933/ to 2.5?

            ys Yang Sheng added a comment - Many thanks. James. Yujian, could you please comment if you have a plan port http://review.whamcloud.com/#/c/7933/ to 2.5?

            This problem looks just like LU-4483. The problem is that the old proc handling was limited to a single page. Some items like this hash listing can easily overflow the page. With the newer proc api using seq_file the page limitation is gone.

            simmonsja James A Simmons added a comment - This problem looks just like LU-4483 . The problem is that the old proc handling was limited to a single page. Some items like this hash listing can easily overflow the page. With the newer proc api using seq_file the page limitation is gone.
            ys Yang Sheng added a comment -

            I think the patch is http://review.whamcloud.com/#/c/7933/. This issue mainly relate to libcfs/libcfs/hash.c change. Looks like YuJian's port patch http://review.whamcloud.com/11945 still not include this part. I'll follow to investigate.

            ys Yang Sheng added a comment - I think the patch is http://review.whamcloud.com/#/c/7933/ . This issue mainly relate to libcfs/libcfs/hash.c change. Looks like YuJian's port patch http://review.whamcloud.com/11945 still not include this part. I'll follow to investigate.

            Hello James,

            Do you have time now to address Yang Sheng's question above?

            Thanks,
            ~ jfc.

            jfc John Fuchs-Chesney (Inactive) added a comment - Hello James, Do you have time now to address Yang Sheng's question above? Thanks, ~ jfc.
            ys Yang Sheng added a comment -

            Hi, James, Could please point out which patch relate to this issue? Thanks.

            ys Yang Sheng added a comment - Hi, James, Could please point out which patch relate to this issue? Thanks.
            simmonsja James A Simmons added a comment - - edited

            Jian Yu is working on back porting the work from LU-3319 to address another issue impacting b2_5. That back port will resolve this as well. Once the fires for 2.5 are put out at the lab I will look to helping with this port.

            simmonsja James A Simmons added a comment - - edited Jian Yu is working on back porting the work from LU-3319 to address another issue impacting b2_5. That back port will resolve this as well. Once the fires for 2.5 are put out at the lab I will look to helping with this port.

            People

              ys Yang Sheng
              sebastien.buisson Sebastien Buisson (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: