Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.8.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Testing on RHEL7.1 kernel servers, I see this warning in the dmesg:

      [ 2083.721470] ------------[ cut here ]------------
      [ 2083.721664] WARNING: at /home/green/git/lustre-release/ldiskfs/namei.c:3167 ldiskfs_orphan_add+0x11e/0x280 [ldiskfs]()
      [ 2083.721992] Modules linked in: lustre(OF) ofd(OF) osp(OF) lod(OF) ost(OF) mdt(OF) mdd(OF) mgs(OF) osd_ldiskfs(OF) ldiskfs(OF) lquota(OF) lfsck(OF) obdecho(OF) mgc(OF) lov(OF) osc(OF) mdc(OF) lmv(OF) fid(OF) fld(OF) ptlrpc(OF) obdclass(OF) ksocklnd(OF) lnet(OF) libcfs(OF) ext4(F) loop(F) mbcache(F) jbd2(F) sha512_generic(F) rpcsec_gss_krb5(F) ata_generic(F) pata_acpi(F) ttm(F) drm_kms_helper(F) drm(F) ata_piix(F) i2c_piix4(F) virtio_blk(F) serio_raw(F) virtio_balloon(F) libata(F) pcspkr(F) floppy(F) i2c_core(F) virtio_console(F) [last unloaded: libcfs]
      [ 2083.723513] CPU: 4 PID: 8690 Comm: ll_ost_io01_002 Tainted: GF       W  O--------------   3.10.0-debug #5
      [ 2083.723827] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 2083.723997]  0000000000000000 000000001b75d70c ffff88002193fa60 ffffffff816ccb68
      [ 2083.724317]  ffff88002193fa98 ffffffff81070efb ffff8800707419f8 0000000000000000
      [ 2083.724627]  ffff880046645800 ffff8800a72affa0 ffff880046644800 ffff88002193faa8
      [ 2083.724934] Call Trace:
      [ 2083.725080]  [<ffffffff816ccb68>] dump_stack+0x19/0x1b
      [ 2083.725247]  [<ffffffff81070efb>] warn_slowpath_common+0x6b/0xb0
      [ 2083.725421]  [<ffffffff8107104a>] warn_slowpath_null+0x1a/0x20
      [ 2083.725599]  [<ffffffffa093d41e>] ldiskfs_orphan_add+0x11e/0x280 [ldiskfs]
      [ 2083.725790]  [<ffffffffa096b732>] ? ldiskfs_block_zero_page_range+0x292/0x3e0 [ldiskfs]
      [ 2083.726104]  [<ffffffffa096df08>] ldiskfs_truncate+0x1c8/0x430 [ldiskfs]
      [ 2083.726295]  [<ffffffffa09fde7f>] osd_punch+0x12f/0x5e0 [osd_ldiskfs]
      [ 2083.726481]  [<ffffffffa0c601cf>] ofd_object_punch+0x6df/0xc10 [ofd]
      [ 2083.726702]  [<ffffffffa0c517b6>] ofd_punch_hdl+0x466/0x9e0 [ofd]
      [ 2083.726913]  [<ffffffffa075a3f5>] tgt_request_handle+0x645/0xfe0 [ptlrpc]
      [ 2083.727123]  [<ffffffffa070bc31>] ptlrpc_server_handle_request+0x231/0xab0 [ptlrpc]
      [ 2083.727434]  [<ffffffffa07097a8>] ? ptlrpc_wait_event+0xb8/0x360 [ptlrpc]
      [ 2083.727635]  [<ffffffffa070fb80>] ptlrpc_main+0xae0/0x1ee0 [ptlrpc]
      [ 2083.727814]  [<ffffffff816d5b97>] ? _raw_spin_unlock_irq+0x27/0x50
      [ 2083.727989]  [<ffffffff816d38be>] ? __schedule+0x2fe/0x810
      [ 2083.728182]  [<ffffffffa070f0a0>] ? ptlrpc_register_service+0xf20/0xf20 [ptlrpc]
      [ 2083.728475]  [<ffffffff8109c00a>] kthread+0xea/0xf0
      [ 2083.728639]  [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      [ 2083.728819]  [<ffffffff816df2bc>] ret_from_fork+0x7c/0xb0
      [ 2083.728986]  [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      [ 2083.729175] ---[ end trace 44cb2ff8a147856d ]---
      [ 2083.729370] ------------[ cut here ]------------
      [ 2083.729539] WARNING: at /home/green/git/lustre-release/ldiskfs/namei.c:3249 ldiskfs_orphan_del+0x17d/0x2a0 [ldiskfs]()
      [ 2083.729866] Modules linked in: lustre(OF) ofd(OF) osp(OF) lod(OF) ost(OF) mdt(OF) mdd(OF) mgs(OF) osd_ldiskfs(OF) ldiskfs(OF) lquota(OF) lfsck(OF) obdecho(OF) mgc(OF) lov(OF) osc(OF) mdc(OF) lmv(OF) fid(OF) fld(OF) ptlrpc(OF) obdclass(OF) ksocklnd(OF) lnet(OF) libcfs(OF) ext4(F) loop(F) mbcache(F) jbd2(F) sha512_generic(F) rpcsec_gss_krb5(F) ata_generic(F) pata_acpi(F) ttm(F) drm_kms_helper(F) drm(F) ata_piix(F) i2c_piix4(F) virtio_blk(F) serio_raw(F) virtio_balloon(F) libata(F) pcspkr(F) floppy(F) i2c_core(F) virtio_console(F) [last unloaded: libcfs]
      [ 2083.731374] CPU: 4 PID: 8690 Comm: ll_ost_io01_002 Tainted: GF       W  O--------------   3.10.0-debug #5
      [ 2083.731693] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 2083.731862]  0000000000000000 000000001b75d70c ffff88002193fa50 ffffffff816ccb68
      [ 2083.732180]  ffff88002193fa88 ffffffff81070efb ffff8800707419f8 ffff8800707419f8
      [ 2083.732488]  ffff8800a72affa0 ffff880046645800 0000000000000000 ffff88002193fa98
      [ 2083.732798] Call Trace:
      [ 2083.732932]  [<ffffffff816ccb68>] dump_stack+0x19/0x1b
      [ 2083.733108]  [<ffffffff81070efb>] warn_slowpath_common+0x6b/0xb0
      [ 2083.733282]  [<ffffffff8107104a>] warn_slowpath_null+0x1a/0x20
      [ 2083.733456]  [<ffffffffa093e6bd>] ldiskfs_orphan_del+0x17d/0x2a0 [ldiskfs]
      [ 2083.733647]  [<ffffffffa096e10b>] ldiskfs_truncate+0x3cb/0x430 [ldiskfs]
      [ 2083.733834]  [<ffffffffa09fde7f>] osd_punch+0x12f/0x5e0 [osd_ldiskfs]
      [ 2083.734027]  [<ffffffffa0c601cf>] ofd_object_punch+0x6df/0xc10 [ofd]
      [ 2083.734208]  [<ffffffffa0c517b6>] ofd_punch_hdl+0x466/0x9e0 [ofd]
      [ 2083.734401]  [<ffffffffa075a3f5>] tgt_request_handle+0x645/0xfe0 [ptlrpc]
      [ 2083.734604]  [<ffffffffa070bc31>] ptlrpc_server_handle_request+0x231/0xab0 [ptlrpc]
      [ 2083.734913]  [<ffffffffa07097a8>] ? ptlrpc_wait_event+0xb8/0x360 [ptlrpc]
      [ 2083.735123]  [<ffffffffa070fb80>] ptlrpc_main+0xae0/0x1ee0 [ptlrpc]
      [ 2083.735302]  [<ffffffff816d5b97>] ? _raw_spin_unlock_irq+0x27/0x50
      [ 2083.735478]  [<ffffffff816d38be>] ? __schedule+0x2fe/0x810
      [ 2083.735661]  [<ffffffffa070f0a0>] ? ptlrpc_register_service+0xf20/0xf20 [ptlrpc]
      [ 2083.735954]  [<ffffffff8109c00a>] kthread+0xea/0xf0
      [ 2083.736127]  [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      [ 2083.736307]  [<ffffffff816df2bc>] ret_from_fork+0x7c/0xb0
      [ 2083.736474]  [<ffffffff8109bf20>] ? kthread_create_on_node+0x140/0x140
      [ 2083.736657] ---[ end trace 44cb2ff8a147856e ]---
      

      The warning in question both times is:

              WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) &&
                           !mutex_is_locked(&inode->i_mutex));
      

      So I imagine we need some sort of extra semaphore?

      Attachments

        Issue Links

          Activity

            [LU-6446] Warn-on in ldiskfs_orphan_add/del
            ys Yang Sheng added a comment -

            Patch landed. Close ticket.

            ys Yang Sheng added a comment - Patch landed. Close ticket.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14690/
            Subject: LU-6446 ldiskfs: remove WARN_ON from ldiskfs_orphan_add

            {del}

            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3daba43a191a0a2c9358d39b3f48b1b759a41819

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14690/ Subject: LU-6446 ldiskfs: remove WARN_ON from ldiskfs_orphan_add {del} Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3daba43a191a0a2c9358d39b3f48b1b759a41819

            Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/14690
            Subject: LU-6446 ldiskfs: remove WARN_ON from ldiskfs_orphan_add

            {del}

            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 34395e31c1a1da84f412f928eacc05a7001d56f3

            gerrit Gerrit Updater added a comment - Yang Sheng (yang.sheng@intel.com) uploaded a new patch: http://review.whamcloud.com/14690 Subject: LU-6446 ldiskfs: remove WARN_ON from ldiskfs_orphan_add {del} Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 34395e31c1a1da84f412f928eacc05a7001d56f3
            bogl Bob Glossman (Inactive) added a comment - - edited

            As Yang Sheng mentioned above holding i_mutex led to deadlocks. Instead we made a patch for ldiskfs, ext4-remove-truncate-warning.patch, to delete the warning. A similar patch was later added to the patches for el6.6 when a similar new warning was seen there. Strongly suspect that el7.1 has similar warnings in more places than in el7.0. We need to either extend the patch to take out more warnings or figure out a way to hold i_mutex without causing deadlocks.

            bogl Bob Glossman (Inactive) added a comment - - edited As Yang Sheng mentioned above holding i_mutex led to deadlocks. Instead we made a patch for ldiskfs, ext4-remove-truncate-warning.patch, to delete the warning. A similar patch was later added to the patches for el6.6 when a similar new warning was seen there. Strongly suspect that el7.1 has similar warnings in more places than in el7.0. We need to either extend the patch to take out more warnings or figure out a way to hold i_mutex without causing deadlocks.
            ys Yang Sheng added a comment - - edited

            We did lock i_mutex around ldiskfs_truncate, But it causes a deadlock. (LU-4252) And add a ldiskfs patch to remove some related warning message. Looks like ext4_orphan_add() and ext4_orphan_del() just were called in ldiskfs_truncate in Lustre.

            ys Yang Sheng added a comment - - edited We did lock i_mutex around ldiskfs_truncate, But it causes a deadlock. ( LU-4252 ) And add a ldiskfs patch to remove some related warning message. Looks like ext4_orphan_add() and ext4_orphan_del() just were called in ldiskfs_truncate in Lustre.
            adilger Andreas Dilger added a comment - - edited

            It looks like there is a requirement to hold i_mutex when calling ext4_orphan_add() and ext4_orphan_del() because the internal locking has changed to reduce contention on s_orphan_lock (landed upstream in 3.15 but backported to RHEL7.1):

            commit d745a8c20c1f864c10ca78d0f89219633861b7e9
            Author: Jan Kara <jack@suse.cz>
            Date:   Mon May 26 11:56:53 2014 -0400
            
                ext4: reduce contention on s_orphan_lock
                
                Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that
                we avoid taking global s_orphan_lock in some cases and hold it for
                shorter time in other cases.
                
                Signed-off-by: Jan Kara <jack@suse.cz>
                Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
            

            At a minimum, we need to hold i_mutex in osd_punch() around calling ldiskfs_truncate(). There may be other callpaths that call ext4_orphan_add() or ext4_orphan_del(), but they may be hidden because this is a WARN_ON_ONCE(). They will show up once osd_punch() is fixed.

            adilger Andreas Dilger added a comment - - edited It looks like there is a requirement to hold i_mutex when calling ext4_orphan_add() and ext4_orphan_del() because the internal locking has changed to reduce contention on s_orphan_lock (landed upstream in 3.15 but backported to RHEL7.1): commit d745a8c20c1f864c10ca78d0f89219633861b7e9 Author: Jan Kara <jack@suse.cz> Date: Mon May 26 11:56:53 2014 -0400 ext4: reduce contention on s_orphan_lock Shuffle code around in ext4_orphan_add() and ext4_orphan_del() so that we avoid taking global s_orphan_lock in some cases and hold it for shorter time in other cases. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> At a minimum, we need to hold i_mutex in osd_punch() around calling ldiskfs_truncate() . There may be other callpaths that call ext4_orphan_add() or ext4_orphan_del(), but they may be hidden because this is a WARN_ON_ONCE(). They will show up once osd_punch() is fixed.

            People

              ys Yang Sheng
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: