Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.6.0
-
None
-
Bug occurred during IOStress testing using code from master on SLES11 SP3. I assume the bug is in 2.6 because
LU-3321landed to that version.
-
3
-
12947
Description
Several application processes hang trying to get a write lock on ll_inode_info.lli_trunc_sem in ll_setattr_raw(). Looks like the processes are each deadlocked on themselves. The call to ll_file_io_generic() earlier in the call stack acquires a read lock on the same semaphore, which prevents the write lock from being granted in ll_setattr_raw().
This bug was introduced by LU-3321, review.whamcloud.com/7893.
> crash> bt
> PID: 10475 TASK: ffff880837ae67f0 CPU: 0 COMMAND: "nsystst"
> #0 [ffff88083cf05698] schedule at ffffffff8144947f
> #1 [ffff88083cf057f0] rwsem_down_failed_common at ffffffff8144b6d5
> #2 [ffff88083cf05860] rwsem_down_write_failed at ffffffff8144b783
> #3 [ffff88083cf05870] call_rwsem_down_write_failed at ffffffff81219c43
> #4 [ffff88083cf058d0] ll_setattr_raw at ffffffffa07ed590 [lustre]
> #5 [ffff88083cf059b0] ll_setattr at ffffffffa07ee557 [lustre]
> #6 [ffff88083cf059c0] notify_change at ffffffff8116e1f0
> #7 [ffff88083cf05a30] file_remove_suid at ffffffff810fa3e1
> #8 [ffff88083cf05ab0] __generic_file_aio_write at ffffffff810fcd29
> #9 [ffff88083cf05b60] generic_file_aio_write at ffffffff810fcfc9
> #10 [ffff88083cf05ba0] vvp_io_write_start at ffffffffa0825cb0 [lustre]
> #11 [ffff88083cf05c00] cl_io_start at ffffffffa0365682 [obdclass]
> #12 [ffff88083cf05c30] cl_io_loop at ffffffffa0369204 [obdclass]
> #13 [ffff88083cf05c60] ll_file_io_generic at ffffffffa07c3062 [lustre]
> #14 [ffff88083cf05ce0] ll_file_aio_write at ffffffffa07c355e [lustre]
> #15 [ffff88083cf05d30] do_sync_readv_writev at ffffffff811539cb
> #16 [ffff88083cf05e40] do_readv_writev at ffffffff811548d4
> #17 [ffff88083cf05f30] vfs_writev at ffffffff81154a28
> #18 [ffff88083cf05f40] sys_writev at ffffffff81154b65
> #19 [ffff88083cf05f80] system_call_fastpath at ffffffff8145376b
> crash> files | egrep "PID|husk1"
> PID: 10475 TASK: ffff880837ae67f0 CPU: 0 COMMAND: "nsystst"
> 3 ffff880835e43bc0 ffff8808000206c0 ffff880837e05178 REG /dsl/lus/husk1/ostest.vers/CL_nsystst03.2672/nsys_base.2
lli_trunc_sem info:
> crash> eval 0xffff880837e05178 - 248 | grep hex
> hexadecimal: ffff880837e05080
> crash> ll_inode_info ffff880837e05080 | grep -A 15 trunc_sem
> f_trunc_sem = {
> count = -4294967295, = 0xffffffff00000001
> wait_lock = {
> {
> rlock = {
> raw_lock = {
> slock = 2313
> }
> }
> }
> },
> wait_list = {
> next = 0xffff88083cf057f8,
> prev = 0xffff88083cf057f8
> }
> },
> crash> semaphore_waiter 0xffff88083cf057f8
> struct semaphore_waiter {
> list = {
> next = 0xffff880837e05440,
> prev = 0xffff880837e05440
> },
> task = 0xffff880837ae67f0,
> up = 2
> }
> crash> ps | grep ffff880837ae67f0
> 10475 1 0 ffff880837ae67f0 UN 0.0 131484 5112 nsystst
LU-3321/7893 changed the logic in ll_file_io_generic to always acquire the lli_trunc_sem semaphore in the IO_NORMAL case. Formerly, the semaphore was only acquired in the read path, when ll_setattr would not be called.
From lustre/llite/file.c:ll_file_io_generic:
> case IO_NORMAL:
> cio->cui_iov = args->u.normal.via_iov;
> cio->cui_nrsegs = args->u.normal.via_nrsegs;
> cio->cui_tot_nrsegs = cio->cui_nrsegs;
> cio->cui_iocb = args->u.normal.via_iocb;
> if ((iot == CIT_WRITE) &&
> !(cio->cui_fd->fd_flags & LL_FILE_GROUP_LOCKED)) {
> if (mutex_lock_interruptible(&lli->
> - lli_write_mutex))
> - GOTO(out, result = -ERESTARTSYS);
> - write_mutex_locked = 1;
> - } else if (iot == CIT_READ) {
> - down_read(&lli->lli_trunc_sem);
> - }
> + lli_write_mutex))
> + GOTO(out, result = -ERESTARTSYS);
> + write_mutex_locked = 1;
> + }
> + down_read(&lli->lli_trunc_sem);
> break;
> case IO_SENDFILE:
> vio->u.sendfile.cui_actor = args->u.sendfile.via_actor;
Attachments
Issue Links
- duplicates
-
LU-4627 Client deadlock on ll_setattr_raw
-
- Resolved
-