[LU-12536] Processes stuck in unkillable sleep waiting on IO during Lustre re-export of NFS testing Created: 11/Jul/19  Updated: 01/Oct/20  Resolved: 27/Jul/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Ann Koehler (Inactive) Assignee: Ann Koehler (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

When a regression test suite is run on an NFS client against a NFS exported Lustre file system, the NFS server/Lustre client slows. Many of the nfsd threads are stuck in osc_extent_wait:

PID: 5989, 6017, 6018, 6022, 6023, 6024, 6025, 6026, 6027, 6028, 6029, 6030, 6031, 6032, 6033, 6034, 6035, 6036, 6037, 6038, 6039, 6040, 6041, 6042, 6043
TASKS: 25
        schedule at ffffffff8161523e
        osc_extent_wait at ffffffffa0ec96b0 [osc]
        osc_cache_wait_range at ffffffffa0ecff5c [osc]
        osc_io_fsync_end at ffffffffa0ebc7c6 [osc]
        cl_io_end at ffffffffa09d6ac5 [obdclass]
        lov_io_end_wrapper at ffffffffa0ca3314 [lov]
        lov_io_fsync_end at ffffffffa0ca366e [lov]
        cl_io_end at ffffffffa09d6ac5 [obdclass]
        cl_io_loop at ffffffffa09da0dc [obdclass]
        cl_sync_file_range at ffffffffa0d9aea5 [lustre]
        ll_writepages at ffffffffa0dc1e83 [lustre]
        do_writepages at ffffffff811519ae
        __filemap_fdatawrite_range at ffffffff81146121
        filemap_write_and_wait_range at ffffffff8114623a
        ll_fsync at ffffffffa0d9b09a [lustre]
        vfs_fsync_range at ffffffff811d925b
        vvp_io_write_start at ffffffffa0df29f7 [lustre]
        cl_io_start at ffffffffa09d6d0e [obdclass]
        cl_io_loop at ffffffffa09da0ce [obdclass]
        ll_file_io_generic at ffffffffa0d91f88 [lustre]
        ll_file_write_iter at ffffffffa0d9257d [lustre]
        do_iter_readv_writev at ffffffff811a988a
        do_readv_writev at ffffffff811aa258
        vfs_writev at ffffffff811aa50c
        nfsd_vfs_write at ffffffff812e5e02
        nfsd_write at ffffffff812e84f8
        nfsd3_proc_write at ffffffff812ed523
        nfsd_dispatch at ffffffff812e14ae
        svc_process at ffffffff815ec536
        nfsd at ffffffff812e0ef0
        kthread at ffffffff81074376
        ret_from_fork at ffffffff8161983f

They are waiting for the extent's oe_state to change to OES_INV but there is no I/O pending that would cause the state to change. The ptlrpcd queues are empty; no threads are performing synchronous I/O.

The problem was traced to a kernel change in generic_write_sync(). It checks for IOCB_DSYNC in the ki_flags instead of O_SYNC and IS_SYNC. As a result, generic_write_sync() is not writing anything and osc_extents are not getting released before the wait begins.

Old function:

int generic_write_sync(struct file *file, loff_t pos, loff_t count)
{
        if (!(file->f_flags & O_DSYNC) && !IS_SYNC(file->f_mapping->host))
                return 0;
        return vfs_fsync_range(file, pos, pos + count - 1,
                               (file->f_flags & __O_SYNC) ? 0 : 1);
}

New function:

static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
{
        if (iocb->ki_flags & IOCB_DSYNC) {
                int ret = vfs_fsync_range(iocb->ki_filp,
                                iocb->ki_pos - count, iocb->ki_pos - 1,
                                (iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
                if (ret)
                        return ret;
        }

        return count;
}


 Comments   
Comment by Gerrit Updater [ 11/Jul/19 ]

Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/35472
Subject: LU-12536 llite: release active extent on sync write commit
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 05749198c9c7fc15e2aa3f29af98511da0d2963e

Comment by Gerrit Updater [ 27/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35472/
Subject: LU-12536 llite: release active extent on sync write commit
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a9af7100ce72ece9c7a37c4d2c28b54fcf68b562

Comment by Peter Jones [ 27/Jul/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:53:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.