[LU-3261] Threads stuck in osc_enter_cache when testing LU-1669 Created: 02/May/13  Updated: 26/Feb/15  Resolved: 14/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Critical
Reporter: Prakash Surya (Inactive) Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 1
Labels: llnl

Issue Links:
Blocker
is blocking LU-1669 lli->lli_write_mutex (single shared f... Resolved
Related
is related to LU-2576 Hangs in osc_enter_cache due to dirty... Resolved
Severity: 3
Rank (Obsolete): 8072

 Description   

I'm running a slightly modified version of master trying to test LU-1669. During my testing I've run into a situation which looks to have the exact same symptoms of LU-2576. While running a single shared file run of FIO using 16 tasks, they've all gotten hung under osc_enter_cache. The backtraces for these stuck threads look like so:

PID: 44453  TASK: ffff88043544f500  CPU: 13  COMMAND: "fio"
 #0 [ffff88022a1634d8] schedule at ffffffff8150dab2
 #1 [ffff88022a1635a0] cfs_waitq_wait+0xe at ffffffffa09f26fe [libcfs]
 #2 [ffff88022a1635b0] osc_enter_cache+0x85a at ffffffffa0efcd9a [osc]
 #3 [ffff88022a1636d0] osc_queue_async_io+0x1044 at ffffffffa0f04e54 [osc]
 #4 [ffff88022a163860] osc_page_cache_add+0xc9 at ffffffffa0ee97d9 [osc]
 #5 [ffff88022a1638a0] cl_page_cache_add+0x7f at ffffffffa0acf2bf [obdclass]
 #6 [ffff88022a163940] lov_page_cache_add+0x85 at ffffffffa0f7a305 [lov]
 #7 [ffff88022a163970] cl_page_cache_add+0x7f at ffffffffa0acf2bf [obdclass]
 #8 [ffff88022a163a10] vvp_io_commit_write+0x3e5 at ffffffffa074b335 [lustre]
 #9 [ffff88022a163a90] cl_io_commit_write+0xad at ffffffffa0ade6cd [obdclass]
#10 [ffff88022a163ae0] ll_commit_write+0xee at ffffffffa072046e [lustre]
#11 [ffff88022a163b40] ll_write_end+0x30 at ffffffffa0738760 [lustre]
#12 [ffff88022a163b70] generic_file_buffered_write+0x18a at ffffffff8111a21a
#13 [ffff88022a163c40] __generic_file_aio_write+0x260 at ffffffff8111bc20
#14 [ffff88022a163d00] vvp_io_write_start+0xcf at ffffffffa074d4ef [lustre]
#15 [ffff88022a163d50] cl_io_start+0x6a at ffffffffa0adb13a [obdclass]
#16 [ffff88022a163d80] cl_io_loop+0xb4 at ffffffffa0adf874 [obdclass]
#17 [ffff88022a163db0] ll_file_io_generic+0x29b at ffffffffa06f4f2b [lustre]
#18 [ffff88022a163e20] ll_file_aio_write+0x142 at ffffffffa06f5f92 [lustre]
#19 [ffff88022a163e80] ll_file_write+0x16c at ffffffffa06f627c [lustre]
#20 [ffff88022a163ef0] vfs_write+0xb8 at ffffffff81180b38
#21 [ffff88022a163f30] sys_write+0x51 at ffffffff81181431
#22 [ffff88022a163f80] system_call_fastpath+0x16 at ffffffff8100b072
    RIP: 00002aaaaaf5f4ed  RSP: 00007fffffff9ff0  RFLAGS: 00000206
    RAX: 0000000000000001  RBX: ffffffff8100b072  RCX: 000000000000012b
    RDX: 0000000000100000  RSI: 00002aaab73fe010  RDI: 0000000000000000
    RBP: 00002aaab4c5f000   R8: 0000000000000000   R9: 00007fffffffa098
    R10: 0000000000000000  R11: 0000000000000293  R12: 00002aaab4c63be0
    R13: 00000000006f9248  R14: 00000000006f9190  R15: 0000000000100000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

As a result I haven't seen any IO resulting from the FIO job in the past couple hours. Looking at the dirty stats on the system, I have:

# hype202 /root > grep . /proc/fs/lustre/osc/*/*dirty*
/proc/fs/lustre/osc/lcz-OST0000-osc-ffff880839905800/cur_dirty_bytes:0
/proc/fs/lustre/osc/lcz-OST0000-osc-ffff880839905800/max_dirty_mb:32
/proc/fs/lustre/osc/lcz-OST0001-osc-ffff880839905800/cur_dirty_bytes:33554432
/proc/fs/lustre/osc/lcz-OST0001-osc-ffff880839905800/max_dirty_mb:32
# hype202 /root > grep . /proc/sys/vm/*dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

The exact line number it's stuck at is here:

(gdb) l *osc_enter_cache+0x85a
0x2edca is in osc_enter_cache (/var/dumps/surya1/dev/lustre/lustre/osc/osc_cache.c:1541).
1536                    osc_io_unplug_async(env, cli, NULL);
1537
1538                    CDEBUG(D_CACHE, "%s: sleeping for cache space @ %p for %p\n",
1539                           cli->cl_import->imp_obd->obd_name, &ocw, oap);
1540
1541                    rc = l_wait_event(ocw.ocw_waitq, ocw_granted(cli, &ocw), &lwi);
1542
1543                    client_obd_list_lock(&cli->cl_loi_list_lock);
1544
1545                    /* l_wait_event is interrupted by signal */

In the interest of full disclosure, the client I'm running is this commit from master (i.e. not a "chaos" derivative):

commit 77aa3f2e38e94d1ac83afa2f7ef18a1ca9de92af
Author: Nathaniel Clark <nathaniel.l.clark@intel.com>
Date:   Wed Apr 24 10:10:25 2013 -0400

    LU-2831 tests: Sync MDS data also when doing sync_all_data

Plus this change:

diff --git i/lustre/llite/file.c w/lustre/llite/file.c
index e61b696..68fe7b9 100644
--- i/lustre/llite/file.c
+++ w/lustre/llite/file.c
@@ -868,7 +868,6 @@ restart:
         if (cl_io_rw_init(env, io, iot, *ppos, count) == 0) {
                 struct vvp_io *vio = vvp_env_io(env);
                 struct ccc_io *cio = ccc_env_io(env);
-                int write_mutex_locked = 0;
 
                 cio->cui_fd  = LUSTRE_FPRIVATE(file);
                 vio->cui_io_subtype = args->via_io_subtype;
@@ -883,10 +882,7 @@ restart:
 #endif
                         if ((iot == CIT_WRITE) &&
                             !(cio->cui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
-                               if (mutex_lock_interruptible(&lli->
-                                                               lli_write_mutex))
-                                        GOTO(out, result = -ERESTARTSYS);
-                                write_mutex_locked = 1;
+                               ;
                         } else if (iot == CIT_READ) {
                                down_read(&lli->lli_trunc_sem);
                         }
@@ -904,9 +900,7 @@ restart:
                         LBUG();
                 }
                 result = cl_io_loop(env, io);
-                if (write_mutex_locked)
-                       mutex_unlock(&lli->lli_write_mutex);
-                else if (args->via_io_subtype == IO_NORMAL && iot == CIT_READ)
+                if (args->via_io_subtype == IO_NORMAL && iot == CIT_READ)
                        up_read(&lli->lli_trunc_sem);
         } else {
                 /* cl_io_rw_init() handled IO */
diff --git i/lustre/llite/vvp_io.c w/lustre/llite/vvp_io.c
index db53842..746c161 100644
--- i/lustre/llite/vvp_io.c
+++ w/lustre/llite/vvp_io.c
@@ -459,8 +459,8 @@ static ssize_t lustre_generic_file_read(struct file *file,
 static ssize_t lustre_generic_file_write(struct file *file,
                                         struct ccc_io *vio, loff_t *ppos)
 {
-        return generic_file_aio_write(vio->cui_iocb, vio->cui_iov,
-                                      vio->cui_nrsegs, *ppos);
+        return __generic_file_aio_write(vio->cui_iocb, vio->cui_iov,
+                                      vio->cui_nrsegs, &vio->cui_iocb->ki_pos);
 }
 #endif

Since I am running a slightly modified version of master, I decided to make a completely new issue rather than reopening LU-2576. But I'm skeptical that this local modification would have caused the hang. I'm inclined to think this is a regression in master (but would be more than happy to be proven wrong).

Also, the test I am running is this:

$ cat tmp.fio
; ssf.fio, single shared file
[global]
directory=/p/lcraterz/surya1/LU-1669/fio
filename=ssf
size=4g
offset_increment=4g
blocksize=1m
direct=0
ioengine=sync
numjobs=16
group_reporting=1
fallocate=none

[write]
rw=write
fsync_on_close=1

$ fio tmp.fio


 Comments   
Comment by Prakash Surya (Inactive) [ 02/May/13 ]

Hmm, echo 3 > /proc/sys/vm/drop_caches doesn't clear it up like it did with LU-2576.

Comment by Jinshan Xiong (Inactive) [ 02/May/13 ]

Niu should take a look at this bug.

Comment by Niu Yawei (Inactive) [ 02/May/13 ]

Prakash, "echo 3 > /proc/sys/vm/drop_caches" didn't trigger write RPCs at all? Do you see anything abnormal in the log? Thanks.

Comment by Prakash Surya (Inactive) [ 02/May/13 ]

No, drop_caches did not trigger the write RPCs as far as I can tell. I don't see anything of interest in the lustre logs or console either. The only messages I see occurring now in the lustre log are:

00020000:00200000:9.0F:1367513100.155021:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 0 [34359742464, 18446744073709551615)
00020000:00200000:9.0:1367513100.155028:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 1 [34359738368, 18446744073709551615)
00000008:00000020:9.0:1367513100.155030:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0ade8 ready 1|-|+ wr 8192|-|- rd 0|- cache page out.
00000008:00000020:9.0:1367513100.155032:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0acc8 ready 0|-|- wr 0|-|- rd 0|- cache page out.
00000080:00200000:9.0:1367513100.155042:0:47294:0:(vvp_io.c:128:vvp_io_fini()) ignore/verify layout 1/0, layout version 0.
00020000:00200000:9.0:1367513105.155028:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 0 [34359742464, 18446744073709551615)
00020000:00200000:9.0:1367513105.155035:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 1 [34359738368, 18446744073709551615)
00000008:00000020:9.0:1367513105.155037:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0ade8 ready 1|-|+ wr 8192|-|- rd 0|- cache page out.
00000008:00000020:9.0:1367513105.155039:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0acc8 ready 0|-|- wr 0|-|- rd 0|- cache page out.
00000080:00200000:9.0:1367513105.155049:0:47294:0:(vvp_io.c:128:vvp_io_fini()) ignore/verify layout 1/0, layout version 0.
00000100:00100000:1.0F:1367513107.752047:0:44097:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44097:1433871909999208:10.1.1.211@o2ib9:400
00000100:00100000:11.0F:1367513107.752052:0:44098:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_4:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44098:1433871909999212:10.1.1.1@o2ib9:400
00000100:00100000:4.0F:1367513107.752071:0:44099:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_5:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44099:1433871909999216:10.1.1.2@o2ib9:400
00000100:00100000:12.0F:1367513107.752082:0:44096:0:(client.c:1441:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_2:265c93ca-bba0-ff90-7573-ba7d6f1e6717:44096:1433871909999204:10.1.1.211@o2ib9:400
00000100:00100000:1.0:1367513107.752689:0:44097:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_3:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44097:1433871909999208:10.1.1.211@o2ib9:400
00000100:00100000:0.0F:1367513107.752689:0:44099:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_5:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44099:1433871909999216:10.1.1.2@o2ib9:400
00000100:00100000:8.0F:1367513107.752733:0:44096:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_2:265c93ca-bba0-ff90-7573-ba7d6f1e6717:44096:1433871909999204:10.1.1.211@o2ib9:400
00000100:00100000:11.0:1367513107.752735:0:44098:0:(client.c:1805:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd_4:f743e3ca-f5f4-8139-6469-a6c36db7ecb7:44098:1433871909999212:10.1.1.1@o2ib9:400
00020000:00200000:9.0:1367513110.155030:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 0 [34359742464, 18446744073709551615)
00020000:00200000:9.0:1367513110.155037:0:47294:0:(lov_io.c:415:lov_io_iter_init()) shrink: 1 [34359738368, 18446744073709551615)
00000008:00000020:9.0:1367513110.155039:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0ade8 ready 1|-|+ wr 8192|-|- rd 0|- cache page out.
00000008:00000020:9.0:1367513110.155040:0:47294:0:(osc_cache.c:2913:osc_cache_writeback_range()) obj ffff88041fb0acc8 ready 0|-|- wr 0|-|- rd 0|- cache page out.
00000080:00200000:9.0:1367513110.155052:0:47294:0:(vvp_io.c:128:vvp_io_fini()) ignore/verify layout 1/0, layout version 0.
Comment by Oleg Drokin [ 02/May/13 ]

So just to be totally clear, can you try the same master snapshot without your patch?
Not only you remove the semaphore, Andreas thinks that the way you switch to __generic_file_aio_write might be confusing to vfs as it might not expect multiple threads on a single client to be able to write to the same file as it depends on the mutex there.

Comment by Prakash Surya (Inactive) [ 02/May/13 ]

Sure, I'll try to reproduce without modifications to the master branch. FWIW, I did not hit this issue when running the stock master tag earlier, but I'll re-run that test a few times to see if I just got lucky or if it is in fact the VFS usage change causing this.

Comment by Lukasz Flis [ 14/Sep/13 ]

Dear All,

We just hit this issue in 2.4.0 and 2.4.1-RC2 on ~100(of 1200) of clients.
I just updated LU-2576 with the details.
As we still keep some clients hanging we can provide more debugging information on this.

Best Regards

Lukasz Flis
ACC Cyfronet

Comment by Andreas Dilger [ 14/Nov/14 ]

This was fixed as part of the original http://review.whamcloud.com/6320 patch landing.

Generated at Sat Feb 10 01:32:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.