[LU-5781] endless loop in osc_lock_weight() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0
Labels:
- p4b
- patch

Severity:
3
Rank (Obsolete):
16220

Description

In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.

A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:

crash> foreach ptlrpcd_rcv bt
PID: 17113  TASK: ffff8806323f8ae0  CPU: 22  COMMAND: "ptlrpcd_rcv"
 #0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
 #1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
 #2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
 #3 [ffff880655547ef0] notify_die at ffffffff810a12be
 #4 [ffff880655547f20] do_nmi at ffffffff8152862b
 #5 [ffff880655547f50] nmi at ffffffff81527ef0
    [exception RIP: cl_page_put+426]
    RIP: ffffffffa05f428a  RSP: ffff8805d85d9990  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: ffff88047c921800  RCX: ffff8805d85d9ab8
    RDX: 0000000000000000  RSI: ffff88047c921800  RDI: ffff880bc515ee60
    RBP: ffff8805d85d99d0   R8: 0000000000000040   R9: ffff8805d85d99a0
    R10: 0000000000009500  R11: ffff880303aac8f0  R12: ffff880bc515ee60
    R13: 000000000000004a  R14: ffff880bc51d2e90  R15: ffff880bc515ee60
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
 #7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
 #8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
 #9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
#10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
#11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
#12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
#13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
#14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
#15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
#16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
#17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
#18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
#19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
#20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
crash>

the hang is caused by an endless loop in osc_lock_weight(),
when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.

it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.

Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.

Attachments

Issue Links

is related to

LU-5787 ptlrpcd_rcv loop in osc_ldlm_weigh_ast

Reopened

mentioned in: Page Loading...

Activity

[LU-5781] endless loop in osc_lock_weight()

Joseph Gmitter (Inactive) made changes - 01/Jul/16 6:45 PM

Remote Link

Original: This issue links to "Page (HPDD Community Wiki)" [ 15929 ]

New: This issue links to "Page (HPDD Community Wiki)" [ 15929 ]

James A Simmons made changes - 19/Apr/16 6:16 PM

Description

Original: In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.

A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:

{noformat}
crash> foreach ptlrpcd_rcv bt
PID: 17113 TASK: ffff8806323f8ae0 CPU: 22 COMMAND: "ptlrpcd_rcv"
#0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
#1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
#2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
#3 [ffff880655547ef0] notify_die at ffffffff810a12be
#4 [ffff880655547f20] do_nmi at ffffffff8152862b
#5 [ffff880655547f50] nmi at ffffffff81527ef0
    [exception RIP: cl_page_put+426]
    RIP: ffffffffa05f428a RSP: ffff8805d85d9990 RFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff88047c921800 RCX: ffff8805d85d9ab8
    RDX: 0000000000000000 RSI: ffff88047c921800 RDI: ffff880bc515ee60
    RBP: ffff8805d85d99d0 R8: 0000000000000040 R9: ffff8805d85d99a0
    R10: 0000000000009500 R11: ffff880303aac8f0 R12: ffff880bc515ee60
    R13: 000000000000004a R14: ffff880bc51d2e90 R15: ffff880bc515ee60
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
#7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
#8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
#9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
#10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
#11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
#12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
#13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
#14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
#15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
#16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
#17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
#18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
#19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
#20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
crash>
{noformat}

the hang is caused by an endless loop in osc_lock_weight(),
when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.

it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.

Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.

New: In our testing a lustre client hangs forever after OST failover. The client haв 64 RAM and was running a parallel file test app over set of 1G files. Right before a complete hang, ptlrpcd_rcv thread eat 100% cpu.

A crashdump was taken. It shows the following stack trace for ptlrpcd_rcv:

{noformat}
crash> foreach ptlrpcd_rcv bt
PID: 17113 TASK: ffff8806323f8ae0 CPU: 22 COMMAND: "ptlrpcd_rcv"
#0 [ffff880655547e90] crash_nmi_callback at ffffffff8102fee6
#1 [ffff880655547ea0] notifier_call_chain at ffffffff8152a965
#2 [ffff880655547ee0] atomic_notifier_call_chain at ffffffff8152a9ca
#3 [ffff880655547ef0] notify_die at ffffffff810a12be
#4 [ffff880655547f20] do_nmi at ffffffff8152862b
#5 [ffff880655547f50] nmi at ffffffff81527ef0
    [exception RIP: cl_page_put+426]
    RIP: ffffffffa05f428a RSP: ffff8805d85d9990 RFLAGS: 00000246
    RAX: 0000000000000000 RBX: ffff88047c921800 RCX: ffff8805d85d9ab8
    RDX: 0000000000000000 RSI: ffff88047c921800 RDI: ffff880bc515ee60
    RBP: ffff8805d85d99d0 R8: 0000000000000040 R9: ffff8805d85d99a0
    R10: 0000000000009500 R11: ffff880303aac8f0 R12: ffff880bc515ee60
    R13: 000000000000004a R14: ffff880bc51d2e90 R15: ffff880bc515ee60
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff8805d85d9990] cl_page_put at ffffffffa05f428a [obdclass]
#7 [ffff8805d85d99d8] osc_page_gang_lookup at ffffffffa0ce3fac [osc]
#8 [ffff8805d85d9a78] osc_ldlm_weigh_ast at ffffffffa0cdb1b7 [osc]
#9 [ffff8805d85d9af8] osc_cancel_weight at ffffffffa0cc1a3b [osc]
#10 [ffff8805d85d9b08] ldlm_cancel_no_wait_policy at ffffffffa07ecec1 [ptlrpc]
#11 [ffff8805d85d9b28] ldlm_prepare_lru_list at ffffffffa07f0f0b [ptlrpc]
#12 [ffff8805d85d9ba8] ldlm_cancel_lru_local at ffffffffa07f1324 [ptlrpc]
#13 [ffff8805d85d9bc8] ldlm_replay_locks at ffffffffa07f14ac [ptlrpc]
#14 [ffff8805d85d9c48] ptlrpc_import_recovery_state_machine at ffffffffa083cb67 [ptlrpc]
#15 [ffff8805d85d9ca8] ptlrpc_replay_interpret at ffffffffa0811a84 [ptlrpc]
#16 [ffff8805d85d9cd8] ptlrpc_check_set at ffffffffa08130ec [ptlrpc]
#17 [ffff8805d85d9d78] ptlrpcd_check at ffffffffa08404db [ptlrpc]
#18 [ffff8805d85d9dd8] ptlrpcd at ffffffffa0840b2b [ptlrpc]
#19 [ffff8805d85d9ee8] kthread at ffffffff8109ac66
#20 [ffff8805d85d9f48] kernel_thread at ffffffff8100c20a
crash>
{noformat}

the hang is caused by an endless loop in osc_lock_weight(),
when CLP_GANG_RESCHED is returned from osc_page_gang_lookup() and the page scan is restarted from the very beginning. Next round CLP_GANG_RESCHED is likely to happen again if number of scanned pages is enough large.

it is proven by setting ldlm.cancel_unused_locks_before_replay=0 at client before OST failover. It cures the hang.

Moreover, I have a patch that makes osc_lock_weight() to restart with the last processed page index + 1 so endless loop is impossible. the patch helps too.

Peter Jones made changes - 15/Nov/15 7:42 PM

Link

Original: This issue is related to LDEV-142 [ LDEV-142 ]

Peter Jones made changes - 15/Nov/15 7:42 PM

Link

New: This issue is related to LDEV-143 [ LDEV-143 ]

Joseph Gmitter (Inactive) made changes - 02/Nov/15 10:05 PM

Remote Link

New: This issue links to "Page (HPDD Community Wiki)" [ 15929 ]

Peter Jones made changes - 05/Sep/15 2:59 PM

Link

New: This issue is related to LDEV-142 [ LDEV-142 ]

Peter Jones made changes - 25/Aug/15 2:22 PM

Link

New: This issue is related to LDEV-45 [ LDEV-45 ]

Andreas Dilger made changes - 24/Aug/15 5:12 PM

Link

New: This issue is related to BOS-19 [ BOS-19 ]

Gerrit Updater added a comment - 30/Jul/15 4:31 AM

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800
Subject: ~~LU-5781~~ ldlm: Solve a race for LRU lock cancel
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: fa78e0ef004d4569fb5e5bc08ecbe3580b42f1e1

Gerrit Updater added a comment - 30/Jul/15 4:31 AM Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/15800 Subject: LU-5781 ldlm: Solve a race for LRU lock cancel Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: fa78e0ef004d4569fb5e5bc08ecbe3580b42f1e1

Jay Lan (Inactive) added a comment - 29/Jul/15 7:02 PM

Is there a b2_5 port of http://review.whamcloud.com/12603 ?
I have conflict in lustre/ldlm/ldlm_request.c

Jay Lan (Inactive) added a comment - 29/Jul/15 7:02 PM Is there a b2_5 port of http://review.whamcloud.com/12603 ? I have conflict in lustre/ldlm/ldlm_request.c

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Alexander Zarochentsev

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 21/Oct/14 4:01 PM

Updated:: 01/Jul/16 6:45 PM

Resolved:: 20/Jul/15 1:17 PM