[LU-2094] LBUG in osc_req_attr_set along write path Created: 05/Oct/12  Updated: 20/Oct/12  Resolved: 20/Oct/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0, Lustre 2.1.1
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Ann Koehler (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Seen on SLES11-SP1 with Lustre 2.1.1 and Lustre 2.2
Also seen on SLES11-SP2 with Lustre 2.2


Issue Links:
Related
is related to LU-2172 osc_req_attr_set LBUG Resolved
Severity: 3
Rank (Obsolete): 4377

 Description   

LBUG reported in LU-1650 brings down client nodes. But the LBUG occurs on a write path rather than a read path. This is the occurrence reported by Michael McKay - 17/Sep/12 in LU-1650.

> 2012-04-20T22:00:48.036542-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522358[3 2 0 1 1 00000006] P(0):[0, 18446744073709551615]@[0x100000000:0x4b1f5c0:0x0] {
> 2012-04-20T22:00:48.065914-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     lovsub@ffff8803f525e4e0: [0 ffff880404198470 P(0):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]]
> 2012-04-20T22:00:48.065973-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     osc@ffff8803f4a3ab50: (null) 00001000 0x85ead42e54740750 1 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
> 2012-04-20T22:00:48.065987-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff8803f4522358
> 2012-04-20T22:00:48.095709-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
> 2012-04-20T22:00:48.199435-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) lock@ffff8803f4522dd8[1 1 0 1 1 00000000] W(2):[0, 0]@[0x100000000:0x4b1f5c0:0x0] {
> 2012-04-20T22:00:48.199467-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     lovsub@ffff8803f7d0a4e0: [0 ffff8807e9ae1a88 W(2):[0, 18446744073709551615]@[0x23d979800:0x1:0x0]]
> 2012-04-20T22:00:48.199478-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set())     osc@ffff8803f5c37b50: (null) 00000000 0x0 0 (null) size: 0 mtime: 0 atime: 0 ctime: 0 blocks: 0
> 2012-04-20T22:00:48.199489-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) } lock@ffff8803f4522dd8
> 2012-04-20T22:00:48.199500-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
> 2012-04-20T22:00:48.199509-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f4901680[3 ffff8807eb639b08:0 ^(null)_ffff8803f49015c0 2 0 1 (null) ffff8807e9b9c380 0x0]
> 2012-04-20T22:00:48.199523-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) page@ffff8803f49015c0[1 ffff8807ea301b58:0 ^ffff8803f4901680_(null) 2 0 1 (null) (null) 0x0]
> 2012-04-20T22:00:48.199543-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff8803f7438f50(0:0:0) vm@ffffea000dc16b30 200000000005019 4:0 ffff8803f4901680 0 lru
> 2012-04-20T22:00:48.199556-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff8803f757d978
> 2012-04-20T22:00:48.199600-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff8803f4b317a8: 1< 0x845fed 2 0 - - + > 2< 0 0 2016 0x3 0x420 | (null) ffff880401df2628 ffff8807e847c340 ffffffffa0516a20 ffff8803f4b317a8 > 3< + ffff8807ffdfe0c0 1 39 0 > 4< 0 0 8 33759232 - | - - - - > 5< - - - - | 0 - - | 0 - ->
> 2012-04-20T22:00:48.226323-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff8803f49015c0
> 2012-04-20T22:00:48.226361-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
> 2012-04-20T22:00:48.226372-05:00 c0-0c0s5n1 Pid: 7196, comm: flush-lustre-1
> 2012-04-20T22:00:48.226381-05:00 c0-0c0s5n1 Call Trace:
> 2012-04-20T22:00:48.226418-05:00 c0-0c0s5n1 [<ffffffff81007b11>] try_stack_unwind+0x1b1/0x200
> 2012-04-20T22:00:48.226432-05:00 c0-0c0s5n1 [<ffffffff81006420>] dump_trace+0x90/0x300
> 2012-04-20T22:00:48.255723-05:00 c0-0c0s5n1 [<ffffffffa014b8d2>] libcfs_debug_dumpstack+0x52/0x80 [libcfs]
> 2012-04-20T22:00:48.255753-05:00 c0-0c0s5n1 [<ffffffffa0514536>] osc_req_attr_set+0x206/0x2a0 [osc]
> 2012-04-20T22:00:48.255775-05:00 c0-0c0s5n1 [<ffffffffa024d0ae>] cl_req_attr_set+0xce/0x1f0 [obdclass]
> 2012-04-20T22:00:48.255786-05:00 c0-0c0s5n1 [<ffffffffa0501e77>] osc_send_oap_rpc+0xc67/0x1c50 [osc]
> 2012-04-20T22:00:48.255817-05:00 c0-0c0s5n1 [<ffffffffa0503136>] osc_check_rpcs+0x2d6/0x470 [osc]
> 2012-04-20T22:00:48.255832-05:00 c0-0c0s5n1 [<ffffffffa0514bf0>] osc_io_submit+0x220/0x4d0 [osc]
> 2012-04-20T22:00:48.285929-05:00 c0-0c0s5n1 [<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
> 2012-04-20T22:00:48.285958-05:00 c0-0c0s5n1 [<ffffffffa0575924>] lov_io_submit+0x254/0xb40 [lov]
> 2012-04-20T22:00:48.311494-05:00 c0-0c0s5n1 [<ffffffffa024d3d8>] cl_io_submit_rw+0x78/0x1a0 [obdclass]
> 2012-04-20T22:00:48.311524-05:00 c0-0c0s5n1 [<ffffffffa05e8424>] ll_writepage+0x2c4/0x480 [lustre]
> 2012-04-20T22:00:48.311557-05:00 c0-0c0s5n1 [<ffffffff810d0b22>] __writepage+0x12/0x50
> 2012-04-20T22:00:48.311580-05:00 c0-0c0s5n1 [<ffffffff810d1213>] write_cache_pages+0x1d3/0x3b0
> 2012-04-20T22:00:48.311592-05:00 c0-0c0s5n1 [<ffffffff810d140f>] generic_writepages+0x1f/0x30
> 2012-04-20T22:00:48.337108-05:00 c0-0c0s5n1 [<ffffffff810d143c>] do_writepages+0x1c/0x40
> 2012-04-20T22:00:48.337139-05:00 c0-0c0s5n1 [<ffffffff8112c422>] writeback_single_inode+0x92/0x2c0
> 2012-04-20T22:00:48.337195-05:00 c0-0c0s5n1 [<ffffffff8112cfca>] writeback_inodes_wb+0x26a/0x3e0
> 2012-04-20T22:00:48.337213-05:00 c0-0c0s5n1 [<ffffffff8112d2e0>] wb_writeback+0x1a0/0x260
> 2012-04-20T22:00:48.337259-05:00 c0-0c0s5n1 [<ffffffff8112d67c>] wb_do_writeback+0x1fc/0x210
> 2012-04-20T22:00:48.337274-05:00 c0-0c0s5n1 [<ffffffff8112d6cb>] bdi_writeback_task+0x3b/0xc0
> 2012-04-20T22:00:48.337286-05:00 c0-0c0s5n1 [<ffffffff810e2321>] bdi_start_fn+0x81/0xf0
> 2012-04-20T22:00:48.362673-05:00 c0-0c0s5n1 [<ffffffff81063dae>] kthread+0x8e/0xa0
> 2012-04-20T22:00:48.362703-05:00 c0-0c0s5n1 [<ffffffff81003d3a>] child_rip+0xa/0x20
> 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG

These messages are the result of executing this code path in osc_req_attr

This bug occurs frequently during system regression testing, but we have no explicit reproducer.

The dumps are too large to attach here, so I will put them on the Cray anonymous ftp server.

2.1 Dump: ftp.cray.com:/outbound/LU1650.21.osc_req_attr_set.lbug.tgz
2.2 Dump: ftp.cray.com:/outbound/LU1650.22.osc_req_attr_set.lbug.tgz

Let me know if you want other dumps or if you want me to look for other info in our collection of dumps. Note that dklogs are NOT available; performance issues require that debug tracing be kept to a minimum.



 Comments   
Comment by Michael McKay (Inactive) [ 09/Oct/12 ]

This is the description of what I have seen from one of the cores
> 2012-04-20T22:00:48.362745-05:00 c0-0c0s5n1 LustreError: 7196:0:(osc_io.c:702:osc_req_attr_set()) LBUG
In this case the object has 2 locks:
LOCK1: is a CLM_PHANTOM lock for the whole file that is being Cancelled
LOCK2: is a WRITE lock for the index that this write request wants, but is currently in the CLS_QUEUING state and is blocked on LOCK1 while its being cancelled.

From this information we can see that we are currently writing out an inode even though we do not have a write/read lock for this object. The CLM_PHANTOM lock is being cancelled at this point in time. This is most likely a case of trying to write something out without ever getting a write lock first, or we have an extra release the write lock.

Comment by Jinshan Xiong (Inactive) [ 20/Oct/12 ]

duplicate of LU-2170

Generated at Sat Feb 10 01:22:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.