[LU-8211] LustreError: 9676:0:(nrs_tbf.c:89:nrs_tbf_rule_fini()) ASSERTION( list_empty(&rule->tr_cli_list) ) failed: ^M Created: 27/May/16  Updated: 22/Sep/16  Resolved: 22/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Emoly Liu
Resolution: Incomplete Votes: 0
Labels: None
Environment:

2.7.1-fe


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

enabling tbf then disabling without creating any rules triggers LBUG.

LustreError: 9676:0:(nrs_tbf.c:89:nrs_tbf_rule_fini()) ASSERTION( list_empty(&rule->tr_cli_list) ) failed: ^M
LustreError: 9676:0:(nrs_tbf.c:89:nrs_tbf_rule_fini()) LBUG^M
Pid: 9676, comm: lctl^M
^M
Call Trace:^M
 [<ffffffffa040e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
 [<ffffffffa040ee97>] lbug_with_loc+0x47/0xb0 [libcfs]^M


 Comments   
Comment by Peter Jones [ 27/May/16 ]

Emoly

Could you please advise?

Thanks

Peter

Comment by Qian Yingjin (Inactive) [ 29/May/16 ]

I have tries the following commands:
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
ost.OSS.ost_io.nrs_policies=tbf nid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="fifo"
ost.OSS.ost_io.nrs_policies=fifo
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_policies="tbf nid"
ost.OSS.ost_io.nrs_policies=tbf nid
[root@QYJ tests]# lctl set_param ost.OSS.ost_io.nrs_tbf_rule="stop A"
but it didn't trigger andy panic failure, could you please post your operation commands in details?

Thanks,
Qian

Comment by Li Xi (Inactive) [ 29/May/16 ]

Hi Mahmoud,

Is following patch merged in your branch?

LU-6939 nrs: add lock to protect TBF rule linkage

Comment by Emoly Liu [ 30/May/16 ]

Mahmoud, as Qian&LiXi said, could you please provide more information to show how to reproduce this LBUG and check your branch? Thanks.

Comment by Jay Lan (Inactive) [ 31/May/16 ]

Hi Li Xi, we do not have the LU-6939 patch cherry-picked to our tree. We rebase to b2_7_fe periodically, but LU-6939 has not landed to b2_7_fe yet.

Our git repo, since b2_7_fe is private, our nas-2.7.1 repo is a private repo as well:
https://github.com/NASAEarthExchange/lustre-nas-fe/commits/nas-2.7.1

Either you give me your github ID and I can add yours to the member list or you can use Peter Jones' ID to access the tree.

Comment by Mahmoud Hanafi [ 31/May/16 ]

Is there a way for us to get a complete list of landed NRS/TBF patches not in 2.7.1-fe? I don't want to keep opening LU on landed patches. For example I just hit where the server locks up. Backtrace showed this.

PID: 32437  TASK: ffff881032c35520  CPU: 3   COMMAND: "ldlm_cn02_023"
 #0 [ffff88085c426e90] crash_nmi_callback at ffffffff81032256
 #1 [ffff88085c426ea0] notifier_call_chain at ffffffff81568515
 #2 [ffff88085c426ee0] atomic_notifier_call_chain at ffffffff8156857a
 #3 [ffff88085c426ef0] notify_die at ffffffff810a44fe
 #4 [ffff88085c426f20] do_nmi at ffffffff8156618f
 #5 [ffff88085c426f50] nmi at ffffffff815659f0
    [exception RIP: _spin_lock+30]
    RIP: ffffffff8156525e  RSP: ffff88085c423e78  RFLAGS: 00000097
    RAX: 0000000000004603  RBX: ffff880f5ea080e0  RCX: 0000000000000000
    RDX: 0000000000004602  RSI: ffff88085c430658  RDI: ffff880f5ea080e0
    RBP: ffff88085c423e78   R8: 0000000000000000   R9: 0000000000000001
    R10: 000000000000002c  R11: 00000000000000b4  R12: ffff880f5ea08000
    R13: ffff88085c430600  R14: ffff88085c423f28  R15: ffffffffa07e4b50
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88085c423e78] _spin_lock at ffffffff8156525e
 #7 [ffff88085c423e80] nrs_tbf_timer_cb at ffffffffa07e4b7a [ptlrpc]
 #8 [ffff88085c423ea0] __run_hrtimer at ffffffff810a29ae
 #9 [ffff88085c423ef0] hrtimer_interrupt at ffffffff810a2d46
#10 [ffff88085c423f70] local_apic_timer_interrupt at ffffffff8103433d
#11 [ffff88085c423f90] smp_apic_timer_interrupt at ffffffff8156ae25
#12 [ffff88085c423fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#13 [ffff880d87207c68] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: nrs_resource_get_safe+88]
    RIP: ffffffffa07d7358  RSP: ffff880d87207d10  RFLAGS: 00000206
    RAX: 0000000000004602  RBX: ffff880d87207d40  RCX: 0000000000000000
    RDX: 0000000000004602  RSI: ffff880efb387778  RDI: ffff880fd2033140
    RBP: ffffffff8100bc0e   R8: ffff880f5ea080e0   R9: 0000000000000000
    R10: 000000000000002c  R11: 00000000000000b4  R12: ffff880d87207c90
    R13: ffff880e4c2fc3d8  R14: ffffffffa086b640  R15: ffff881000000007
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#14 [ffff880d87207d48] ptlrpc_nrs_req_initialize at ffffffffa07d9dab [ptlrpc]
#15 [ffff880d87207d68] ptlrpc_server_handle_req_in at ffffffffa079c647 [ptlrpc]
#16 [ffff880d87207da8] ptlrpc_main at ffffffffa07a484c [ptlrpc]
#17 [ffff880d87207ee8] kthread at ffffffff8109dc8e
#18 [ffff880d87207f48] kernel_thread at ffffffff8100c28a



PID: 32405  TASK: ffff880f2c04eab0  CPU: 5   COMMAND: "ldlm_cn02_004"
 #0 [ffff88085c446e90] crash_nmi_callback at ffffffff81032256
 #1 [ffff88085c446ea0] notifier_call_chain at ffffffff81568515
 #2 [ffff88085c446ee0] atomic_notifier_call_chain at ffffffff8156857a
 #3 [ffff88085c446ef0] notify_die at ffffffff810a44fe
 #4 [ffff88085c446f20] do_nmi at ffffffff8156618f
 #5 [ffff88085c446f50] nmi at ffffffff815659f0
    [exception RIP: _spin_lock+33]
    RIP: ffffffff81565261  RSP: ffff880af029dd00  RFLAGS: 00000293
    RAX: 0000000000004605  RBX: ffff880efb387478  RCX: 0000000000000000
    RDX: 0000000000004602  RSI: ffff880efb387478  RDI: ffff880f5ea080e0
    RBP: ffff880af029dd00   R8: ffff880f5ea080e0   R9: 0000000000000000
    R10: 0000000000000031  R11: 00000000000000b4  R12: ffff880efb387478
    R13: ffff880eb3ae53c0  R14: ffff880f5ea080e0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff880af029dd00] _spin_lock at ffffffff81565261
 #7 [ffff880af029dd08] nrs_resource_get_safe at ffffffffa07d7341 [ptlrpc]
 #8 [ffff880af029dd48] ptlrpc_nrs_req_initialize at ffffffffa07d9dab [ptlrpc]
 #9 [ffff880af029dd68] ptlrpc_server_handle_req_in at ffffffffa079c647 [ptlrpc]
#10 [ffff880af029dda8] ptlrpc_main at ffffffffa07a484c [ptlrpc]
#11 [ffff880af029dee8] kthread at ffffffff8109dc8e
Comment by Li Xi (Inactive) [ 04/Jun/16 ]

Hi, Mahmoud

The dump stack looks like following issue. And it has an patch which has already been merged into master branch. Would you please check whether your branch has merged this patch?
https://jira.hpdd.intel.com/browse/LU-5717

Also, please check whether following patches are merged too:

LU-6939 nrs: add lock to protect TBF rule linkage
LU-5717 ptlrpc: fix deadlock problem of nrs_tbf_timer_cb (this problem)
LU-6921 test: failed to operate on TBF rules (only test bug)
LU-5580 ptlrpc: policy switch directly in tbf (feature improvement)
LU-5379 ptlprc: return 0 if buf in struct seq_file is overflow (when too many rules are defined, accessing nrs_tbf_rule would return error)
LU-5320: fixes for errors found by coccinelle (An lock error of TBF, which might cause deadlock)

And following patches are still under review, but worth to be merged for improvement:
LU-7470 nrs: extend TBF with NID/JobID expression
LU-8006 ptlrpc: specify ordering of TBF policy rules
LU-8006 ptlrpc: cleanup codes of TBF command

And also, if client side QoS is useful for your use case, please check:
https://jira.hpdd.intel.com/browse/LU-7982

Comment by Li Xi (Inactive) [ 04/Jun/16 ]

Hi Jay,

Sorry for missing you message. My Github ID is ddn-lixi. Would you please add me to the group? Thanks!

Comment by Jay Lan (Inactive) [ 06/Jun/16 ]

Our production systems are running 2.7.1-4.1nasS. That build does not have most of the NRS TBF patches.

Our next images on deck to be installed is 2.7.1-5nasS. It has all landed patches except LU-6939, but none of the unlanded.

Our 2.7.2-1nas images for testing also includes LU-6939, but none of the unlanded.

Comment by Jay Lan (Inactive) [ 06/Jun/16 ]

Hi Li Xi,

I am sure you has access privilege to b2_7_fe, but I just want to double check before I add you to our lustre-nas-fe repo access list.

Comment by Mahmoud Hanafi [ 22/Sep/16 ]

Can be closed

Generated at Sat Feb 10 02:15:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.