[LU-465] sanity.sh test_99e failed with timeout Created: 27/Jun/11  Updated: 06/Sep/19  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:
  • jenkins-ga832ab5-PRISTINE-2.6.18-238.12.1.el5 (i686)
  • jenkins-ga832ab5-PRISTINE-2.6.18-238.12.1.el5_lustre (i686)

Issue Links:
Duplicate
duplicates LU-4300 ptlrpcd threads deadlocked in cl_lock... Resolved
Severity: 3
Rank (Obsolete): 4273

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/6dd5828c-a049-11e0-aee5-52540025f9af.



 Comments   
Comment by Sarah Liu [ 27/Jun/11 ]

10:18:28:Lustre: DEBUG MARKER: == sanity test 99e: cvs update ======================================================================= 03:18:27 (1309083507)
10:18:29:Lustre: 8994:0:(sec.c:407:sptlrpc_req_get_ctx()) maximum lustre stack 3140
10:18:29: [<f916e52c>] sptlrpc_req_get_ctx+0x28c/0x580 [ptlrpc]
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29:low stack detected by irq handler
10:18:29: [<c04074c4>] do_IRQ+0x87/0xc3
10:18:29: [<c040597a>] common_interrupt+0x1a/0x20
10:18:29: [<c0425e94>] vprintk+0x2b4/0x2e8
10:18:29: [<c0425ee0>] printk+0x18/0x8e
10:18:29: [<c0440d63>] __print_symbol+0x1a/0x23
10:18:29: [<c04258c3>] release_console_sem+0x17e/0x1b8
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29: [<f8d8e139>] cfs_print_to_console+0xb9/0x100 [libcfs]
10:18:29: [<f8d9e4d2>] libcfs_debug_vmsg2+0x562/0x9e0 [libcfs]
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29: [<c0425ee0>] printk+0x18/0x8e
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:29: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:30: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:30: [<c0405fe8>] print_trace_address+0x1b/0x24
10:18:30: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:30: [<c040603f>] dump_trace+0x4e/0x96
10:18:30: [<c0406097>] show_trace_log_lvl+0x10/0x20
10:18:30: [<c040656f>] show_trace+0xa/0xc
10:18:30: [<c040666c>] dump_stack+0x15/0x17
10:18:30: [<f916e52c>] sptlrpc_req_get_ctx+0x28c/0x580 [ptlrpc]
10:18:30: [<f91040e2>] __ptlrpc_request_bufs_pack+0x62/0x5d0 [ptlrpc]
10:18:30: [<f91688c3>] req_capsule_init+0x43/0xa0 [ptlrpc]
10:18:30: [<f90feedc>] ptlrpc_prep_req_from_pool+0x1c/0xe0 [ptlrpc]
10:18:30: [<f9168831>] req_capsule_set+0x21/0x70 [ptlrpc]
10:18:30: [<f9168e9b>] req_capsule_set_size+0x5b/0x2a0 [ptlrpc]
10:18:30: [<f9103b1a>] ptlrpc_request_alloc_internal+0x8a/0x510 [ptlrpc]
10:18:30: [<f9104c0a>] ptlrpc_request_bufs_pack+0x4a/0x60 [ptlrpc]
10:18:30: [<f9104c37>] ptlrpc_request_pack+0x17/0x20 [ptlrpc]
10:18:30: [<f94ced64>] osc_brw_prep_request+0x274/0x1d90 [osc]
10:18:30: [<f9af9c0a>] cl_lock_at_page+0x2ba/0x2d0 [obdclass]
10:18:30: [<f9b08add>] cl_req_slice_add+0x5d/0x230 [obdclass]
10:18:30: [<f9505bc0>] osc_req_attr_set+0x0/0x3d0 [osc]
10:18:31: [<f9505ca7>] osc_req_attr_set+0xe7/0x3d0 [osc]
10:18:31: [<f934b315>] ccc_req_attr_set+0x75/0x150 [lustre]
10:18:31: [<f9505bc0>] osc_req_attr_set+0x0/0x3d0 [osc]
10:18:31: [<f9b07de2>] cl_req_page_add+0x82/0x2d0 [obdclass]
10:18:31: [<f94da990>] osc_send_oap_rpc+0x1480/0x2d10 [osc]
10:18:31: [<f92e1388>] vvp_write_pending+0x58/0x230 [lustre]
10:18:31: [<f94dc3c0>] osc_check_rpcs+0x1a0/0x690 [osc]
10:18:31: [<f9b01cc9>] cl_page_list_move+0x69/0x240 [obdclass]
10:18:31: [<c04f1ff7>] vsnprintf+0x2e7/0x4db
10:18:31: [<f94d8636>] loi_list_maint+0x66/0xe0 [osc]
10:18:31: [<f950622c>] osc_io_submit+0x29c/0x590 [osc]
10:18:31: [<f926ef02>] lov_page_stripe+0x52/0x250 [lov]
10:18:31: [<f9505f90>] osc_io_submit+0x0/0x590 [osc]
10:18:31: [<f9b015df>] cl_io_submit_rw+0x8f/0x2b0 [obdclass]
10:18:31: [<f9272d69>] lov_io_submit+0x4d9/0x1150 [lov]
10:18:31: [<f9af37d1>] cl_page_get+0x51/0x220 [obdclass]
10:18:31: [<f935246c>] vvp_page_unmap+0x4c/0xa0 [lustre]
10:18:31: [<f9352420>] vvp_page_unmap+0x0/0xa0 [lustre]
10:18:31: [<f9aeee7a>] cl_page_invoke+0x8a/0x270 [obdclass]
10:18:31: [<f9272890>] lov_io_submit+0x0/0x1150 [lov]
10:18:31: [<f9b015df>] cl_io_submit_rw+0x8f/0x2b0 [obdclass]
10:18:32: [<f9af2631>] cl_page_gang_lookup+0x1e1/0x4c0 [obdclass]
10:18:32: [<f9b01a8b>] cl_sync_io_init+0x4b/0x200 [obdclass]
10:18:32: [<f9aef06f>] cl_page_unmap+0xf/0x20 [obdclass]
10:18:32: [<f9b05ad1>] cl_io_submit_sync+0xb1/0x1b0 [obdclass]
10:18:32: [<f9afeee9>] cl_lock_page_out+0x1e9/0x540 [obdclass]
10:18:32: [<f9501218>] osc_lock_flush+0x38/0x70 [osc]
10:18:32: [<f9502c04>] osc_lock_cancel+0x44/0x1e0 [osc]
10:18:32: [<f9af796d>] cl_lock_trace0+0x12d/0x1a0 [obdclass]
10:18:32: [<f9af6dd8>] cl_lock_cancel0+0x78/0x230 [obdclass]
10:18:32: [<f9af7a99>] cl_lock_cancel+0xb9/0x230 [obdclass]
10:18:32: [<f9af7caf>] cl_lock_mutex_tail+0x3f/0x50 [obdclass]
10:18:32: [<f95039dd>] osc_ldlm_blocking_ast+0x24d/0x320 [osc]
10:18:32: [<f9503790>] osc_ldlm_blocking_ast+0x0/0x320 [osc]
10:18:32: [<f90a647f>] ldlm_cancel_callback+0x6f/0x1a0 [ptlrpc]
10:18:32: [<f8da299e>] cfs_hash_bd_from_key+0x2e/0xa0 [libcfs]
10:18:32: [<f8da2756>] cfs_hash_bd_lookup_intent+0x16/0xf0 [libcfs]
10:18:32: [<f90d1940>] ldlm_cli_cancel_local+0xa0/0x690 [ptlrpc]
10:18:32: [<f90d184e>] ldlm_prepare_lru_list+0x53e/0x590 [ptlrpc]
10:18:32: [<f90d3294>] ldlm_cli_cancel_list_local+0xd4/0x310 [ptlrpc]
10:18:33: [<f90d6960>] ldlm_cancel_lrur_policy+0x0/0x100 [ptlrpc]
10:18:33: [<f90d5263>] ldlm_prep_elc_req+0x613/0x710 [ptlrpc]
10:18:33: [<f91688c3>] req_capsule_init+0x43/0xa0 [ptlrpc]
10:18:33: [<f9168831>] req_capsule_set+0x21/0x70 [ptlrpc]
10:18:33: [<f90d5389>] ldlm_prep_enqueue_req+0x29/0x30 [ptlrpc]
10:18:33: [<f94d7e3b>] osc_enqueue_base+0x21b/0x8f0 [osc]
10:18:33: [<c041ec40>] __wake_up+0x2a/0x3d
10:18:33: [<f9504079>] osc_lock_enqueue+0x259/0xdd0 [osc]
10:18:33: [<f95030b0>] osc_lock_upcall+0x0/0x6e0 [osc]
10:18:33: [<f9af7f93>] cl_lock_state_set+0x93/0x290 [obdclass]
10:18:33: [<f9503e20>] osc_lock_enqueue+0x0/0xdd0 [osc]
10:18:33: [<f9afbdd4>] cl_enqueue_try+0x124/0x5b0 [obdclass]
10:18:33: [<f9269cd4>] lov_sublock_hold+0x24/0x320 [lov]
10:18:33: [<c041ec40>] __wake_up+0x2a/0x3d
10:18:33: [<f926c630>] lov_lock_enqueue+0x180/0xb60 [lov]
10:18:33: [<f926c4b0>] lov_lock_enqueue+0x0/0xb60 [lov]
10:18:33: [<f9afbdd4>] cl_enqueue_try+0x124/0x5b0 [obdclass]
10:18:33: [<f9af7601>] cl_lock_user_add+0x51/0x200 [obdclass]
10:18:33: [<f9afd5d0>] cl_lock_hold_mutex+0x650/0xaf0 [obdclass]
10:18:33: [<f9afccf6>] cl_enqueue_locked+0x86/0x2c0 [obdclass]
10:18:34: [<f9aff81d>] cl_lock_request+0x8d/0x2d0 [obdclass]
10:18:34: [<f9347b60>] cl_glimpse_lock+0x170/0x570 [lustre]
10:18:34: [<f93481ea>] cl_glimpse_size+0x28a/0x2a0 [lustre]
10:18:34: [<f92beecd>] ll_inode_revalidate_it+0x7d/0x380 [lustre]
10:18:34: [<c048570c>] __link_path_walk+0xd89/0xdab
10:18:34: [<f92bf205>] ll_getattr_it+0x35/0x110 [lustre]
10:18:34: [<f92bf31f>] ll_getattr+0x3f/0x60 [lustre]
10:18:34: [<f92bf2e0>] ll_getattr+0x0/0x60 [lustre]
10:18:34: [<c047f797>] vfs_getattr+0x40/0x9b
10:18:34: [<c047f819>] vfs_lstat_fd+0x27/0x39
10:18:34: [<c0462a54>] vma_prio_tree_insert+0x17/0x2a
10:18:34: [<c047f870>] sys_lstat64+0xf/0x23
10:18:34: [<c044cb5b>] audit_syscall_entry+0x18f/0x1b9
10:18:34: [<c0407f5f>] do_syscall_trace+0xab/0xb1
10:18:34: [<c0404f4b>] syscall_call+0x7/0xb
10:18:34: =======================

Comment by Jinshan Xiong (Inactive) [ 29/Jun/11 ]

To fix this issue, we should avoid issuing new IO by lock early cancellation. This can be done via the following solution:
1. implement a tagged cl_page radix tree;
2. call ->l_weigh_ast before a lock is gonna be early cancelled; and in ->l_weigh_ast, we should check if there are dirty pages under this lock. If this is the case, we should skip this lock(we may have a `washing' process to clean pages).

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Comment by Andreas Dilger [ 06/Sep/19 ]

This was fixed by the patch:

LU-4300 ldlm: ELC picks locks in a safer policy

Change the policy of ELC to pick locks that have no dirty pages, no page in writeback state, and no locked pages.

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-on: http://review.whamcloud.com/9175
Generated at Sat Feb 10 01:07:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.