[LU-5912] locking flaw generates logged errors Created: 12/Nov/14  Updated: 23/Apr/15  Resolved: 16/Jan/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Bob Glossman (Inactive) Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 1
Labels: None
Environment:

el6.6


Issue Links:
Duplicate
is duplicated by LU-6118 locking flaw generates logged errors Resolved
Related
Severity: 3
Rank (Obsolete): 16507

 Description   

we are violating the vfs locking scheme in el6.6 during our direct calling of f_op->fsync(). The call sequence cfs_tracefile_dump_all_pages() > filp_fsync() -> fp>f_op->fsync() -> ext4_sync_file() -> ext4_flush_unwritten_io() is triggering the following code:

        WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex) &&
                     !(inode->i_state & I_FREEING));

Comment in ext4_sync_file() says:

  • i_mutex lock is held when entering and exiting this function

We don't conform to that requirement, thus hitting the WARN_ON_ONCE() check in ext4_flush_unwritten_io().

Not sure how to repair this problem. We can't just put a mutex_lock/mutex_unlock inside or around our filp_fsync() as there are other kernel versions we actively support that lock i_mutex inside their ext4_sync_file() routines, at least el7 and sles11sp3, maybe others too.

While this isn't a fatal problem and doesn't cause panics & crashes, it is a regression.

As an example a typical instance looks like this in the syslog:

Nov 12 08:59:21 centos65-2 kernel: WARNING: at fs/ext4/inode.c:3929 ext4_flush_unwritten_io+0x74/0x80 [ext4]() (Not tainted)
Nov 12 08:59:21 centos65-2 kernel: Hardware name: VMware Virtual Platform
Nov 12 08:59:21 centos65-2 kernel: Modules linked in: jbd sha512_generic crc32c_intel libcfs(U) rfcomm sco bridge bnep l2cap autofs4 8021q garp stp llc ipv6 fuse uinput microcode vmware_balloon btusb bluetooth rfkill snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 sg i2c_piix4 i2c_core shpchp ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: lnet]
Nov 12 08:59:21 centos65-2 kernel: Pid: 106487, comm: lctl Not tainted 2.6.32.504.1.3.l1111 #1
Nov 12 08:59:21 centos65-2 kernel: Call Trace:
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa00e8bb4>] ? ext4_flush_unwritten_io+0x74/0x80 [ext4]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa00e4fc8>] ? ext4_sync_file+0x88/0x1d0 [ext4]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa03a3cf8>] ? cfs_tracefile_dump_all_pages+0x178/0x2c0 [libcfs]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa03a3ecb>] ? cfs_trace_dump_debug_buffer_usrstr+0x8b/0x90 [libcfs]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa039a583>] ? __proc_dump_kernel+0x23/0x30 [libcfs]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa0399efb>] ? lprocfs_call_handler+0x2b/0x70 [libcfs]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffffa0399f95>] ? proc_dump_kernel+0x25/0x30 [libcfs]
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff81204157>] ? proc_sys_call_handler+0x97/0xd0
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff812041a4>] ? proc_sys_write+0x14/0x20
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff8118e4a8>] ? vfs_write+0xb8/0x1a0
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff8118ee71>] ? sys_write+0x51/0x90
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff810e5ece>] ? __audit_syscall_exit+0x25e/0x290
Nov 12 08:59:21 centos65-2 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Nov 12 08:59:21 centos65-2 kernel: ---[ end trace f63906e52fe1f987 ]---


 Comments   
Comment by Andreas Dilger [ 13/Nov/14 ]

Bob, can you check what patch introduced this WARN_ON(), so we can see what the implications of this change are. It seems unlikely that RHEL would have changed the locking in the middle of a stable release, but I've been surprised before.

As for the solution, we need a configure check that determines whether the lock is held before calling fsync(), and make this locking conditional in Lustre. Ideally there would be an easy way to check this, but if not then we've resorted to grepping in the sources to check if the mutex is held in the caller or not.

On the client we don't have any expectation of access to the source tree, only the headers, so it is hopefully something we can detect from there.

Comment by Bob Glossman (Inactive) [ 13/Nov/14 ]

the obvious grep for the specific comment line about holding i_mutex won't work as a configure test. the ext4 source in sles11sp3 has an exactly similar line, but it lies. in sles11sp3 it actually does the locking inside ext4 code in spite of the comment saying otherwise.

i'm having a hard time thinking up a configure test that can make the correct determination.

Comment by Bob Glossman (Inactive) [ 13/Nov/14 ]

I think the WARN_ON_ONCE() check went into kernel code with the following commit:

commit c278531d39f3158bfee93dc67da0b77e09776de2
Author: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Fri Oct 5 11:31:55 2012 -0400

ext4: fix ext4_flush_completed_IO wait semantics

BUG #1) All places where we call ext4_flush_completed_IO are broken
because buffered io and DIO/AIO goes through three stages
1) submitted io,
2) completed io (in i_completed_io_list) conversion pended
3) finished io (conversion done)
And by calling ext4_flush_completed_IO we will flush only
requests which were in (2) stage, which is wrong because:
1) punch_hole and truncate must wait for all outstanding unwritten io
regardless to it's state.
2) fsync and nolock_dio_read should also wait because there is
a time window between end_page_writeback() and ext4_add_complete_io(
As result integrity fsync is broken in case of buffered write
to fallocated region:
fsync blkdev_completion
->filemap_write_and_wait_range
->ext4_end_bio
->end_page_writeback
<-- filemap_write_and_wait_range return
->ext4_flush_completed_IO
sees empty i_completed_io_list but pended
conversion still exist
->ext4_add_complete_io

BUG #2) Race window becomes wider due to the 'ext4: completed_io
locking cleanup V4' patch series

This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
prevent endless wait

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>

Not exactly sure what kernel.org kernel version it first appears in.

Comment by Bob Glossman (Inactive) [ 14/Nov/14 ]

I believe the mod when the check went in may not be the issue. When the locking of i_mutex moved from the caller of a filesystem's fsync routine to inside it may be what we want to find & do a configure test for. Before then it was safe to do the locking in the caller, after then it is not.

As far as I can tell i_mutex locking is done inside the fsync as early as 3.0 (sles11sp3) kernels and is still done in the most recent kernels I have looked at.

It is only not done in 2.6.32-xxx kernels (6.5 & 6.6). It is only checked for in 6.6 kernels and some later kernels, but the check is gone from the most recent.

Comment by Bob Glossman (Inactive) [ 14/Nov/14 ]

I think the kernel mod that moved i_mutex locking into .fsync routines was this one:

commit 02c24a82187d5a628c68edfe71ae60dc135cd178
Author: Josef Bacik <josef@redhat.com>
Date: Sat Jul 16 20:44:56 2011 -0400

fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers

Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Comment by Bob Glossman (Inactive) [ 14/Nov/14 ]

As far as I can see it looks like the versions needing i_mutex locking outside ext4_file_sync() exactly matches versions where ->fsync() routines take 3 args. There is already configure tests & #ifdef's for that. I think I should be able to come up with a patch that will fit in the existing conditionals for our filp_fsync() definitions.

Comment by Bob Glossman (Inactive) [ 14/Nov/14 ]

proposed fix:
http://review.whamcloud.com/12731

Comment by Gerrit Updater [ 14/Nov/14 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/12731
Subject: LU-5912 libcfs: add i_mutex locking around fsync calls
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 71555e0963c94857a56e282c869075bcd9782dd0

Comment by Gerrit Updater [ 04/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12731/
Subject: LU-5912 libcfs: use vfs api for fsync calls
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a1f7493109cc995f43204d9a0f19a229ec5edef6

Comment by Karsten Weiss [ 12/Jan/15 ]

FWIW: We currently use Lustre 2.5.3 on CentOS 6.5 servers and see this WARNING during shutdown of our Lustre 2.5.3 clients running (officially unsupported) CentOS 6.6
(kernel-2.6.32_504.3.3). CentOS 6.6 support is rather important on the clients because of all the security fixes since CentOS 6.5.

Can we expect to see a fix for the Lustre 2.5.x maintenance branch? (I can't find a RHEL 6.6 kernel update notice in the preliminary release notes of either 2.5.4 or 2.5.5 - only in 2.7.0.)

Comment by Gerrit Updater [ 14/Jan/15 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: http://review.whamcloud.com/13404
Subject: LU-5912 libcfs: use vfs api for fsync calls
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: da52bfb0d279fbcd2df77832b0356533eef04797

Comment by Jodi Levi (Inactive) [ 16/Jan/15 ]

Patch landed to Master.

Comment by Gerrit Updater [ 27/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13404/
Subject: LU-5912 libcfs: use vfs api for fsync calls
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 88f2129504f0530106ebdbe7f5ff918d03bfcb42

Comment by Gerrit Updater [ 11/Feb/15 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/13730
Subject: LU-5912 build: Fix XeonPhi build
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a4b9ec2f73b0b6ab434ee1b624803fe1cf84a00

Comment by Gerrit Updater [ 27/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13730/
Subject: LU-5912 build: Fix XeonPhi build
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 88a0262fdea8eadd620e0adf593332203465673d

Comment by Gerrit Updater [ 05/Mar/15 ]

Dmitry Eremin (dmitry.eremin@intel.com) uploaded a new patch: http://review.whamcloud.com/13982
Subject: LU-5912 build: Fix XeonPhi build
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: 5c7cc5ebc873cb3a45f4f67bd7061125a4f457a2

Generated at Sat Feb 10 01:55:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.