[LU-77] cl_page.c::cl_page_own0() assertion in echoclient Created: 09/Feb/11  Updated: 11/May/11  Resolved: 05/May/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: Lustre 2.1.0

Type: Bug Priority: Blocker
Reporter: Oleg Drokin Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Bugzilla ID: 24,361
Rank (Obsolete): 5052

 Description   

Oracle reports this assertion failure when running obdfilter survey:
obdfilter-survey test 2a hung and hit the following LBUG on one of the client nodes:

Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network
============================================= 08:40:49 (1292686849)
Lustre: 8086:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0000_osc->host_0_UUID netid
50000: select flavor null
Lustre: 8086:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages
LustreError: 8309:0:(osc_request.c:773:osc_announce_cached()) dirty 11296 - 11297 > system
dirty_max 589824
LustreError: 8290:0:(osc_request.c:773:osc_announce_cached()) dirty 12006 - 12007 > system
dirty_max 589824
LustreError: 8302:0:(osc_request.c:773:osc_announce_cached()) dirty 12051 - 12052 > system
dirty_max 589824
LustreError: 8306:0:(osc_request.c:773:osc_announce_cached()) dirty 5853 - 5854 > system dirty_max
589824
LustreError: 8400:0:(osc_request.c:773:osc_announce_cached()) dirty 10889 - 10890 > system
dirty_max 589824
LustreError: 8388:0:(osc_request.c:773:osc_announce_cached()) dirty 9779 - 9780 > system dirty_max
589824
LustreError: 8387:0:(osc_request.c:773:osc_announce_cached()) dirty 4950 - 4951 > system dirty_max
589824
LustreError: 8387:0:(osc_request.c:773:osc_announce_cached()) Skipped 1 previous similar message
LustreError: 8517:0:(osc_request.c:773:osc_announce_cached()) dirty 10796 - 10797 > system
dirty_max 589824
LustreError: 8517:0:(osc_request.c:773:osc_announce_cached()) Skipped 1 previous similar message
LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) page@ffff81010a07ccc0[2 ffff810063a3ecd0:0
^0000000000000000_0000000000000000 1 0 2 ffff81010b699610 0000000000000000 0x0]
LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) echo_client-page@ffff81010a24bf78
vm@ffff810101e03cc8
LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) osc-page@ffff810109a339b8: 1< 0x845fed 258 0 - -

  • > 2< 0 0 0x0 0x308 | 0000000000000000 ffff8100614308e8 ffff810066e8f600 ffffffff889451c0
    ffff810109a339b8 > 3< - ffff81011ff8e040 0 0 1 > 4< 0 7 8 39845888 - | + - + - > 5< + - + - | 0 - -
    512 + +>
    LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) end page@ffff81010a07ccc0
    LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) pg->cp_owner == NULL
    LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) ASSERTION(0) failed
    LustreError: 8756:0:(cl_page.c:986:cl_page_own0()) LBUG
    Pid: 8756, comm: lctl

Call Trace:
[<ffffffff885b85f1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
[<ffffffff885b8b2a>] lbug_with_loc+0x7a/0xd0 [libcfs]
[<ffffffff885c3960>] cfs_tracefile_init+0x0/0x10a [libcfs]
[<ffffffff886ab720>] cl_page_own0+0x1a0/0x2f0 [obdclass]
[<ffffffff88ac7801>] echo_client_brw_ioctl+0x1531/0x1cd0 [obdecho]
[<ffffffff8000d47a>] dput+0x2c/0x114
[<ffffffff88066381>] nfs_lookup_revalidate+0x2be/0x443 [nfs]
[<ffffffff88acaf50>] echo_client_iocontrol+0x1360/0x1b00 [obdecho]
[<ffffffff800cc354>] zone_statistics+0x3e/0x6d
[<ffffffff800d1707>] __vmalloc_area_node+0x12e/0x156
[<ffffffff88654e17>] obd_ioctl_getdata+0x5b7/0xeb0 [obdclass]
[<ffffffff8002c9bc>] mntput_no_expire+0x19/0x89
[<ffffffff8866965c>] class_handle_ioctl+0x1dcc/0x2160 [obdclass]
[<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
[<ffffffff8000d9e4>] permission+0x8d/0xc8
[<ffffffff801aaaeb>] misc_open+0x16c/0x260
[<ffffffff8865457a>] obd_class_ioctl+0x19a/0x230 [obdclass]
[<ffffffff80064c7d>] lock_kernel+0x1b/0x32
[<ffffffff8004217f>] do_ioctl+0x55/0x6b
[<ffffffff800301de>] vfs_ioctl+0x457/0x4b9
[<ffffffff800b76a3>] audit_syscall_entry+0x180/0x1b3
[<ffffffff8004c607>] sys_ioctl+0x59/0x78
[<ffffffff8005d28d>] tracesys+0xd5/0xe0

Kernel panic - not syncing: LBUG

Eric Mei comments that apparently obdecho threads incorrectly share pages they are not supposed to.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 16/Feb/11 ]

I'm quite sure that the obdfilter-survey was doing rewrite when it hit this assertion.

The root cause of this problem is that the previous write is still not finished when rewriting to the same object comes. The reason why the previous write is not finished is due to busyness of osc(have 7 write rpcs in flight). But but the reason why osc is so much busy is unknown.

Maybe we can fix this problem by introducing a page lock for echo_page. In this way, the upcoming write to the same page will be blocked.

Comment by Jian Yu [ 01/Apr/11 ]

While running obdfilter-survey test 2a on the latest stable CentOS5/x86_64 master build (#139 for client, #178 for server), I also hit the same LBUG on the client node. Here is the syslog:

Apr  1 05:47:45 client-4 kernel: Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network ============================================= 05:47:45 (1301662065)
Apr  1 05:47:45 client-4 xinetd[3129]: EXIT: shell status=0 pid=8025 duration=0(sec)
Apr  1 05:47:45 client-4 kernel: Lustre: 8139:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import lustre-OST0000_osc->host_0_UUID netid 50000: select flavor null
Apr  1 05:47:45 client-4 kernel: Lustre: 8139:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 5 previous similar messages

Message from syslogd@ at Fri Apr  1 05:48:14 2011 ...
client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) ASSERTION(0) failed

Message from syslogd@ at Fri Apr  1 05:48:14 2011 ...
client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) LBUG
Apr  1 05:48:13 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) page@ffff8103084c97b8[2 ffff810318366bc8:119002 ^0000000000000000_0000000000000000 1 0 2 ffff81030bf9cb38 0000000000000000 0x0]
Apr  1 05:48:13 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) echo_client-page@ffff8103080a3198 vm@ffff81010aa26e18
Apr  1 05:48:14 client-4 kernel: format at cl_page.c:986:cl_page_own0 doesn't end in newline
Apr  1 05:48:14 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) osc-page@ffff810308783280: 1< 0x845fed 258 0 - - - > 2< 487432192 0 0x0 0x308 | 0000000000000000 ffff81030fe485e8 ffff81030bf89c00 ffffffff88a37be0 ffff810308783280 > 3< - ffff810331d85820 0 0 1 > 4< 0 7 8 28311552 - | - - - - > 5< - - - - | 0 - - | 0 - -<3>LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) end page@ffff8103084c97b8
Apr  1 05:48:14 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) pg->cp_owner == NULL
Apr  1 05:48:14 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) ASSERTION(0) failed
Apr  1 05:48:14 client-4 kernel: LustreError: 8354:0:(cl_page.c:986:cl_page_own0()) LBUG
Apr  1 05:48:14 client-4 kernel: Pid: 8354, comm: lctl
Comment by Jinshan Xiong (Inactive) [ 26/Apr/11 ]

I've verified the patch at: http://review.whamcloud.com/#change,462. It can fix the problem. W/o this patch, obdfilter-survey.sh hits this problem often, after applying this patch, this problem has never been hit again.

Comment by Jinshan Xiong (Inactive) [ 04/May/11 ]

The AutoTest passed at: https://maloo.whamcloud.com/test_sessions/c8fd045c-75f4-11e0-a1b3-52540025f9af

There is a failure on ost-pools:test-18. I think this is a known issue, and I have filed a ticket at LU-276.

Please check it.

Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,client,el6,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,client,el5,ofa #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,ofa #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » x86_64,server,el5,ofa #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,server,el6,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,client,el5,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,client,el5,ofa #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,server,el5,inkernel #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Build Master (Inactive) [ 05/May/11 ]

Integrated in lustre-master » i686,server,el5,ofa #107
LU-77 cl_page.c::cl_page_own0() assertion in echoclient

Oleg Drokin : 8861ce8829752d29ef6afd49b5e046f306d93b5e
Files :

  • lustre/obdecho/echo_client.c
Comment by Peter Jones [ 05/May/11 ]

Landed for 2.1. Please reopen if this issue reoccurs with this fix in place

Comment by Sarah Liu [ 11/May/11 ]

Running 8 tests including obdfilter-survey, pass

https://maloo.whamcloud.com/test_sets/0e790810-7c34-11e0-b5bf-52540025f9af

Generated at Sat Feb 10 01:03:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.