[LU-6390] lru_size on the OSC is not honored Created: 19/Mar/15 Updated: 01/Nov/18 Resolved: 19/Jun/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 2 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Here is all resutls results with 200K files. # lctl set_param ldlm.namespaces.*.lru_size=1000 ldlm.namespaces.MGC10.0.10.153@o2ib.lru_size=1000 ldlm.namespaces.lustre-MDT0000-mdc-ffff881fccd75800.lru_size=1000 ldlm.namespaces.lustre-OST0000-osc-ffff881fccd75800.lru_size=1000 ldlm.namespaces.lustre-OST0001-osc-ffff881fccd75800.lru_size=1000 ldlm.namespaces.lustre-OST0002-osc-ffff881fccd75800.lru_size=1000 ldlm.namespaces.lustre-OST0003-osc-ffff881fccd75800.lru_size=1000 # ls -lR /lustre # lctl get_param ldlm.namespaces.*.lock_count ldlm.namespaces.MGC10.0.10.153@o2ib.lock_count=4 ldlm.namespaces.lustre-MDT0000-mdc-ffff881fccd75800.lock_count=1002 ldlm.namespaces.lustre-OST0000-osc-ffff881fccd75800.lock_count=50003 ldlm.namespaces.lustre-OST0001-osc-ffff881fccd75800.lock_count=50002 ldlm.namespaces.lustre-OST0002-osc-ffff881fccd75800.lock_count=50003 ldlm.namespaces.lustre-OST0003-osc-ffff881fccd75800.lock_count=50004 |
| Comments |
| Comment by Gerrit Updater [ 02/Apr/15 ] |
|
Vitaly Fertman (vitaly_fertman@xyratex.com) uploaded a new patch: http://review.whamcloud.com/14342 |
| Comment by Ann Koehler (Inactive) [ 28/May/15 ] |
|
Customer site installed http://review.whamcloud.com/14342 on their clients. After ~6 hours of running, a data mover client hung. The node is not configured to take memory dumps so we captured stack traces from /proc. The stack traces are reminiscent of 10730 ll_agl_21508 11228 ll_agl_21487 [<ffffffffa0f4eb90>] osc_extent_wait+0x420/0x670 [osc] [<ffffffffa0f4f0af>] osc_cache_wait_range+0x2cf/0x890 [osc] [<ffffffffa0f50281>] osc_cache_writeback_range+0xc11/0xfb0 [osc] [<ffffffffa0f3b6f4>] osc_lock_flush+0x84/0x280 [osc] [<ffffffffa0f3b9d7>] osc_lock_cancel+0xe7/0x1c0 [osc] [<ffffffffa0b4cbf5>] cl_lock_cancel0+0x75/0x160 [obdclass] [<ffffffffa0b4d7ab>] cl_lock_cancel+0x13b/0x140 [obdclass] [<ffffffffa0f3cf1a>] osc_ldlm_blocking_ast+0x13a/0x350 [osc] [<ffffffffa0cf703c>] ldlm_cancel_callback+0x6c/0x1a0 [ptlrpc] [<ffffffffa0d06eaa>] ldlm_cli_cancel_local+0x8a/0x470 [ptlrpc] [<ffffffffa0d0a1ae>] ldlm_cli_cancel_list_local+0xee/0x290 [ptlrpc] [<ffffffffa0d0b055>] ldlm_cancel_lru_local+0x35/0x40 [ptlrpc] [<ffffffffa0d0c4cc>] ldlm_prep_elc_req+0x3ec/0x4b0 [ptlrpc] [<ffffffffa0d0c5b8>] ldlm_prep_enqueue_req+0x28/0x30 [ptlrpc] [<ffffffffa0f205d9>] osc_enqueue_base+0x109/0x5a0 [osc] [<ffffffffa0f3c5cd>] osc_lock_enqueue+0x1ed/0x890 [osc] [<ffffffffa0b50c2c>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0fce64a>] lov_lock_enqueue+0x21a/0xf10 [lov] [<ffffffffa0b50c2c>] cl_enqueue_try+0xfc/0x300 [obdclass] [<ffffffffa0b51b4f>] cl_enqueue_locked+0x6f/0x1f0 [obdclass] [<ffffffffa0b5279e>] cl_lock_request+0x7e/0x270 [obdclass] [<ffffffffa109d000>] cl_glimpse_lock+0x180/0x490 [lustre] [<ffffffffa109d875>] cl_glimpse_size0+0x1a5/0x1d0 [lustre] [<ffffffffa1095ffb>] ll_agl_trigger+0x1db/0x4b0 [lustre] [<ffffffffa1096e6e>] ll_agl_thread+0x15e/0x490 [lustre] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff 6143 ptlrpcd_0 + 10 more ptlrpcd threads (out of 32) [<ffffffffa0b4e6df>] cl_lock_mutex_get+0x6f/0xd0 [obdclass] [<ffffffffa0fd5b19>] lovsub_parent_lock+0x49/0x120 [lov] [<ffffffffa0fd6c4f>] lovsub_lock_modify+0x7f/0x1e0 [lov] [<ffffffffa0b4e108>] cl_lock_modify+0x98/0x310 [obdclass] [<ffffffffa0f3de32>] osc_lock_granted+0x1e2/0x2b0 [osc] [<ffffffffa0f3e308>] osc_lock_upcall+0x408/0x600 [osc] [<ffffffffa0f1e7a6>] osc_enqueue_fini+0x106/0x240 [osc] [<ffffffffa0f23272>] osc_enqueue_interpret+0xe2/0x1e0 [osc] [<ffffffffa0d2487c>] ptlrpc_check_set+0x2bc/0x1b50 [ptlrpc] [<ffffffffa0d500cb>] ptlrpcd_check+0x53b/0x560 [ptlrpc] [<ffffffffa0d5071b>] ptlrpcd+0x33b/0x3f0 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff 6209 ldlm_bl_00 + 46 other ldlm_bl threads (out of 60) [<ffffffffa0b4e6df>] cl_lock_mutex_get+0x6f/0xd0 [obdclass] [<ffffffffa0f3ce5a>] osc_ldlm_blocking_ast+0x7a/0x350 [osc] [<ffffffffa0d0f0c0>] ldlm_handle_bl_callback+0x130/0x400 [ptlrpc] [<ffffffffa0d0f5f1>] ldlm_bl_thread_main+0x261/0x3c0 [ptlrpc] [<ffffffff8109abf6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffffffffffff>] 0xffffffffffffffff I'll attach the complete list of stack traces to this ticket. Let me know whether you need a dump and I'll see if we can reproduce the bug on a test system. |
| Comment by Ann Koehler (Inactive) [ 28/May/15 ] |
|
ps output followed by |
| Comment by Ann Koehler (Inactive) [ 28/May/15 ] |
|
Unique stack traces for Lustre processes extracted from bt.all |
| Comment by Ann Koehler (Inactive) [ 28/May/15 ] |
|
dmesg file from the data mover node. Shows partial output from The forced stack trace dump was done at least 30 minutes before the /proc/pid/stack output was captured. So comparisons between dmesg and bt.all show that the threads are indeed hung. |
| Comment by Vitaly Fertman [ 28/May/15 ] |
|
cl_lock_mutex_get does not exists in 2.8, since the CLIO simplification, so this was not 2.8 lustre what you tested. the patch by itself is supposed to be correct, I think the problem is related to the issue rased in |
| Comment by Ann Koehler (Inactive) [ 28/May/15 ] |
|
Sorry, forgot to mention, the client is running 2.5.1 on CentOS 2.6.32-431.20.3.el6.x86_64. (Same system as in LELUS-294). So are you saying that the |
| Comment by Gerrit Updater [ 19/Jun/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/14342/ |
| Comment by Peter Jones [ 19/Jun/15 ] |
|
Landed for 2.8 |