[LU-1735] sanityn 18 hung Created: 10/Aug/12 Updated: 20/Sep/12 Resolved: 20/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.3.0 |
| Fix Version/s: | Lustre 2.3.0, Lustre 2.4.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Li Wei (Inactive) | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Single VM. |
||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 2210 | ||||
| Description |
|
Observed this potential recursive locking of mm_sem on orion_head_sync: Lustre: DEBUG MARKER: == sanityn test 18: mmap sanity check =================================== 10:53:49 (1344567229) INFO: task mmap_sanity:29252 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mmap_sanity D 0000000000000000 0 29252 29230 0x00000080 ffff8800095116e0 0000000000000082 ffff880011a75e40 ffff880009511668 ffff880022934940 000000000000001d ffff880011a75300 ffff880022934940 ffff88002f077ab8 ffff880009511fd8 000000000000fb88 ffff88002f077ab8 Call Trace: [<ffffffff81500185>] rwsem_down_failed_common+0x95/0x1d0 [<ffffffff81500316>] rwsem_down_read_failed+0x26/0x30 [<ffffffff8127e924>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffff814ff814>] ? down_read+0x24/0x30 [<ffffffffa030e838>] cfs_get_environ+0x158/0x5f0 [libcfs] [<ffffffffa04dd317>] lustre_get_jobid+0x107/0x2c0 [obdclass] [<ffffffffa07c1f72>] ptlrpcd_add_req+0x52/0x350 [ptlrpc] [<ffffffffa07705db>] ? ldlm_cli_enqueue+0x1eb/0x790 [ptlrpc] [<ffffffffa0773aef>] ? ldlm_prep_elc_req+0x23f/0x530 [ptlrpc] [<ffffffffa08dd830>] ? osc_ldlm_glimpse_ast+0x0/0x170 [osc] [<ffffffffa08c6f96>] osc_enqueue_base+0x426/0x580 [osc] [<ffffffffa08dde64>] osc_lock_enqueue+0x204/0x850 [osc] [<ffffffffa08df100>] ? osc_lock_upcall+0x0/0x600 [osc] [<ffffffffa053491c>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa09599ea>] lov_lock_enqueue+0x23a/0x830 [lov] [<ffffffffa053491c>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0535e1d>] cl_enqueue_locked+0x6d/0x210 [obdclass] [<ffffffffa0536ace>] cl_lock_request+0x7e/0x280 [obdclass] [<ffffffffa0c1765b>] cl_glimpse_lock+0x17b/0x4a0 [lustre] [<ffffffffa0c17ee7>] cl_glimpse_size0+0x187/0x190 [lustre] [<ffffffffa0c04040>] ll_file_mmap+0xe0/0x220 [lustre] [<ffffffff81145b00>] mmap_region+0x400/0x590 [<ffffffff81145fca>] do_mmap_pgoff+0x33a/0x380 [<ffffffff81135a00>] sys_mmap_pgoff+0x200/0x2d0 [<ffffffff810104e9>] sys_mmap+0x29/0x30 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 10/Aug/12 ] |
|
I saw this problem somewhere else. This problem is imported by jobid. |
| Comment by Andreas Dilger [ 10/Aug/12 ] |
|
Moved this over to LU, since this is a bug on master, not only Orion. It looks like the problem is that ptlrpc_set_add_req() is always calling lustre_get_jobid(), even though it may have already been set in the request by the caller, and in other cases it isn't even needed. That introduces more overhead than necessary, and opens the code to this deadlock under mmap IO, where cfs_access_process_vm() locks mm->mmap_sem at the same time as the mmap code does. I've submitted http://review.whamcloud.com/3604 to hopefully fix this problem, and also make the jobid code a bit more efficient. |
| Comment by Andreas Dilger [ 17/Aug/12 ] |
|
Updated patch submitted for testing. |
| Comment by Peter Jones [ 22/Aug/12 ] |
|
The blocking fix has landed for both 2.3 and 2.4. Lowering priority but keeping open to track additional code cleanup |
| Comment by Andreas Dilger [ 31/Aug/12 ] |
|
The rest of the cleanup is in http://review.whamcloud.com/3713 |
| Comment by Niu Yawei (Inactive) [ 18/Sep/12 ] |
|
minor fix for the previous cleanup patch: http://review.whamcloud.com/4024 |
| Comment by Peter Jones [ 20/Sep/12 ] |
|
Landed for 2.3 and 2.4 |