[LU-5062] LBUG: osc_req_attr_set Created: 14/May/14  Updated: 20/Jan/15  Resolved: 24/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: osc, readahead
Environment:

SLES11SP3 clients, master (v2_5_58_0-105-gb7f1952), Cray stress testing.


Attachments: HTML File log     File log.sort     File lustre-ext.so    
Issue Links:
Related
is related to LU-5685 (cl_lock.c:1128:cl_use_try()) ASSERTI... Closed
Severity: 3
Rank (Obsolete): 13981

 Description   

Cray has recently begun stress testing master clients in preparation for 2.6.0.

We hit this bug on several nodes during a stress run last night:

2014-05-13T23:00:06.485598-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) page@ffff880221781c00[4 ffff880222618600 3 0 1           (null) ffff88022328d740 0x0]
2014-05-13T23:00:06.485630-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) vvp-page@ffff880221781ca0(0:0:0) vm@ffffea000e1502a0 200000000000801 4:0 ffff880221781c00 0 lru
2014-05-13T23:00:06.485637-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) lov-page@ffff880221781cf8, raid0
2014-05-13T23:00:06.485672-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) osc-page@ffff880221781d60 0: 1< 0x845fed 1 0 + + > 2< 0 0 4096 0x7 0x8 |           (null) ffff88044ad844f8 ffff88044d7a59e8 > 3< + ffff88044e055080 1 16 0 > 4< 0 0 8 35586048 - | - - - + > 5< - - - + | 0 - | 0 - ->
2014-05-13T23:00:06.485681-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) end page@ffff880221781c00
2014-05-13T23:00:06.485689-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) dump uncover page!
2014-05-13T23:00:06.485700-05:00 c1-0c0s2n1 Pid: 14298, comm: nsystst
2014-05-13T23:00:06.485710-05:00 c1-0c0s2n1 Call Trace:
2014-05-13T23:00:06.485724-05:00 c1-0c0s2n1 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
2014-05-13T23:00:06.485746-05:00 c1-0c0s2n1 [<ffffffff81004919>] dump_trace+0x89/0x450
2014-05-13T23:00:06.485757-05:00 c1-0c0s2n1 [<ffffffffa023b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-05-13T23:00:06.485784-05:00 c1-0c0s2n1 [<ffffffffa092f348>] osc_req_attr_set+0x6b8/0x740 [osc]
2014-05-13T23:00:06.485796-05:00 c1-0c0s2n1 [<ffffffffa038dbf1>] cl_req_attr_set+0xd1/0x220 [obdclass]
2014-05-13T23:00:06.485807-05:00 c1-0c0s2n1 [<ffffffffa091a6da>] osc_build_rpc+0x4fa/0x15d0 [osc]
2014-05-13T23:00:06.485817-05:00 c1-0c0s2n1 [<ffffffffa093c199>] osc_io_unplug0+0x1419/0x2020 [osc]
2014-05-13T23:00:06.485837-05:00 c1-0c0s2n1 [<ffffffffa093ddb1>] osc_io_unplug+0x11/0x20 [osc]
2014-05-13T23:00:06.485844-05:00 c1-0c0s2n1 [<ffffffffa093dfa1>] osc_queue_sync_pages+0x1e1/0x380 [osc]
2014-05-13T23:00:06.485852-05:00 c1-0c0s2n1 [<ffffffffa092f90d>] osc_io_submit+0x2dd/0x4e0 [osc]
2014-05-13T23:00:06.485871-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-13T23:00:06.485881-05:00 c1-0c0s2n1 [<ffffffffa076123e>] lov_io_submit+0x2ee/0x4d0 [lov]
2014-05-13T23:00:06.485894-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-13T23:00:06.485900-05:00 c1-0c0s2n1 [<ffffffffa038f3fa>] cl_io_read_page+0x19a/0x1b0 [obdclass]
2014-05-13T23:00:06.485907-05:00 c1-0c0s2n1 [<ffffffffa081bf7c>] ll_readpage+0x16c/0x1b0 [lustre]
2014-05-13T23:00:06.485920-05:00 c1-0c0s2n1 [<ffffffff810fdc68>] generic_file_aio_read+0x268/0x740
2014-05-13T23:00:06.485932-05:00 c1-0c0s2n1 [<ffffffffa084e0b0>] vvp_io_read_start+0x2a0/0x430 [lustre]
2014-05-13T23:00:06.485938-05:00 c1-0c0s2n1 [<ffffffffa038d6b2>] cl_io_start+0x72/0x140 [obdclass]
2014-05-13T23:00:06.485947-05:00 c1-0c0s2n1 [<ffffffffa0391234>] cl_io_loop+0xb4/0x1b0 [obdclass]
2014-05-13T23:00:06.485959-05:00 c1-0c0s2n1 [<ffffffffa07e9ccc>] ll_file_io_generic+0x58c/0x8a0 [lustre]
2014-05-13T23:00:06.485967-05:00 c1-0c0s2n1 [<ffffffffa07ea49c>] ll_file_aio_read+0x22c/0x290 [lustre]
2014-05-13T23:00:06.485994-05:00 c1-0c0s2n1 [<ffffffff8115517b>] do_sync_readv_writev+0xdb/0x120
2014-05-13T23:00:06.486000-05:00 c1-0c0s2n1 [<ffffffff81156084>] do_readv_writev+0xd4/0x1e0
2014-05-13T23:00:06.486010-05:00 c1-0c0s2n1 [<ffffffff811563c5>] vfs_readv+0x45/0x60
2014-05-13T23:00:06.486016-05:00 c1-0c0s2n1 [<ffffffff81156505>] sys_readv+0x55/0xc0
2014-05-13T23:00:06.486023-05:00 c1-0c0s2n1 [<ffffffff8142286b>] system_call_fastpath+0x16/0x1b
2014-05-13T23:00:06.486031-05:00 c1-0c0s2n1 [<000000002006d41b>] 0x2006d41b
2014-05-13T23:00:06.510785-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:826:osc_req_attr_set()) LBUG

I'll make the dump+logs available shortly.

I note this bug was seen by Oleg during patch review for the readahead changes in LU-3321, see his comments here:

http://review.whamcloud.com/#/c/8523/

I also note a previous instance of this assertion was fixed in LU-1650:
https://jira.hpdd.intel.com/browse/LU-1650



 Comments   
Comment by Peter Jones [ 14/May/14 ]

Patrick

How up to date is your master? Are any patches applied?

Peter

Comment by Patrick Farrell (Inactive) [ 14/May/14 ]

Peter -

Current as of mid-day yesterday, no patches. We're doing this stress testing to try to help improve release quality, so we're doing it unmodified.

Most recent commit is:

commit b7f1952882e9e133b43acb22d010767c664539b9
Author: Mikhail Pershin <mike.pershin@intel.com>
Date: Thu May 8 22:29:24 2014 +0400

LU-4916 lwp: don't add connection for missed LWP

The LWP isn't needed for some MDT-MDT connections,
e.g. for MDT0-MDT1, but lustre_lwp_add_conn() is called
for any 'add mdc' marker and cause error.

Patch add LWP connection only where it is needed.

Looking at commits since then, the only possibly related commit is one for LU-4977, but that's not really down this code path.

Comment by Patrick Farrell (Inactive) [ 14/May/14 ]

Dump is up at:
ftp.whamcloud.com:/uploads/LU-5062/LU_5062_140514.tar.gz

Comment by Jodi Levi (Inactive) [ 15/May/14 ]

Jinshan,
Can you please comment on this one?
Thank you!

Comment by Jinshan Xiong (Inactive) [ 15/May/14 ]

Do you have patch http://review.whamcloud.com/10220 applied on your tree? Probably this is a known problem.

Comment by Patrick Farrell (Inactive) [ 15/May/14 ]

Jinshan,

From the git history, it looks like we had that commit... But it does look like it would fix the problem, so it's possible I had older source than I thought I did. I will test again and if we continue to see this problem, I'll let you know.

Comment by Jinshan Xiong (Inactive) [ 15/May/14 ]

Patrick, yes, please try it. Thank you very much for helping us to improve our release quality.

Comment by Patrick Farrell (Inactive) [ 15/May/14 ]

OK, same software environment (SLES11SP3 client, Cray stress testing), current master - 2.5.59 tag.

Still hit it, and with the same test:

2014-05-15T17:18:35.970514-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) page@ffff880b2372cc00[4 ffff880b2040eb10 3 0 1 (null) ffff881037c8ac00 0x0]
2014-05-15T17:18:35.997237-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) vvp-page@ffff880b2372cca0(0:0:0) vm@ffffea0038eaadb8 600000000000801 4:0 ffff880b2372cc00 0 lru
2014-05-15T17:18:36.022514-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) lov-page@ffff880b2372ccf8, raid0
2014-05-15T17:18:36.054138-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) osc-page@ffff880b2372cd60 0: 1< 0x845fed 1 0 + + > 2< 0 0 4096 0x7 0x8 | (null) ffff881837716578 ffff880200914ea8 > 3< + ffff8810379f7800 1 16 0 > 4< 0 0 8 43520000 - | - - - + > 5< - - - + | 0 - | 0 - ->
2014-05-15T17:18:36.079394-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) end page@ffff880b2372cc00
2014-05-15T17:18:36.104691-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:824:osc_req_attr_set()) dump uncover page!
2014-05-15T17:18:36.104732-05:00 c0-0c2s0n3 Pid: 28354, comm: nsystst
2014-05-15T17:18:36.104746-05:00 c0-0c2s0n3 Call Trace:
2014-05-15T17:18:36.104756-05:00 c0-0c2s0n3 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
2014-05-15T17:18:36.129940-05:00 c0-0c2s0n3 [<ffffffff81004939>] dump_trace+0x89/0x450
2014-05-15T17:18:36.129981-05:00 c0-0c2s0n3 [<ffffffffa02218c7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-05-15T17:18:36.129993-05:00 c0-0c2s0n3 [<ffffffffa0918348>] osc_req_attr_set+0x6b8/0x740 [osc]
2014-05-15T17:18:36.155180-05:00 c0-0c2s0n3 [<ffffffffa0374be1>] cl_req_attr_set+0xd1/0x220 [obdclass]
2014-05-15T17:18:36.155220-05:00 c0-0c2s0n3 [<ffffffffa09036da>] osc_build_rpc+0x4fa/0x15d0 [osc]
2014-05-15T17:18:36.180416-05:00 c0-0c2s0n3 [<ffffffffa0925199>] osc_io_unplug0+0x1419/0x2020 [osc]
2014-05-15T17:18:36.180456-05:00 c0-0c2s0n3 [<ffffffffa0926db1>] osc_io_unplug+0x11/0x20 [osc]
2014-05-15T17:18:36.180467-05:00 c0-0c2s0n3 [<ffffffffa0926fa1>] osc_queue_sync_pages+0x1e1/0x380 [osc]
2014-05-15T17:18:36.205619-05:00 c0-0c2s0n3 [<ffffffffa091890d>] osc_io_submit+0x2dd/0x4e0 [osc]
2014-05-15T17:18:36.205659-05:00 c0-0c2s0n3 [<ffffffffa03743fe>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-15T17:18:36.230909-05:00 c0-0c2s0n3 [<ffffffffa074a23e>] lov_io_submit+0x2ee/0x4d0 [lov]
2014-05-15T17:18:36.230948-05:00 c0-0c2s0n3 [<ffffffffa03743fe>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-15T17:18:36.256143-05:00 c0-0c2s0n3 [<ffffffffa03763ea>] cl_io_read_page+0x19a/0x1b0 [obdclass]
2014-05-15T17:18:36.256185-05:00 c0-0c2s0n3 [<ffffffffa0804f4c>] ll_readpage+0x16c/0x1b0 [lustre]
2014-05-15T17:18:36.256195-05:00 c0-0c2s0n3 [<ffffffff810ffec8>] generic_file_aio_read+0x268/0x740
2014-05-15T17:18:36.281326-05:00 c0-0c2s0n3 [<ffffffffa0837030>] vvp_io_read_start+0x2a0/0x430 [lustre]
2014-05-15T17:18:36.281367-05:00 c0-0c2s0n3 [<ffffffffa03746a2>] cl_io_start+0x72/0x140 [obdclass]
2014-05-15T17:18:36.306578-05:00 c0-0c2s0n3 [<ffffffffa0378224>] cl_io_loop+0xb4/0x1b0 [obdclass]
2014-05-15T17:18:36.306618-05:00 c0-0c2s0n3 [<ffffffffa07d2ccc>] ll_file_io_generic+0x58c/0x8a0 [lustre]
2014-05-15T17:18:36.306630-05:00 c0-0c2s0n3 [<ffffffffa07d349c>] ll_file_aio_read+0x22c/0x290 [lustre]
2014-05-15T17:18:36.331852-05:00 c0-0c2s0n3 [<ffffffffa07d3e45>] ll_file_read+0x1e5/0x270 [lustre]
2014-05-15T17:18:36.331893-05:00 c0-0c2s0n3 [<ffffffff811585f8>] vfs_read+0xc8/0x180
2014-05-15T17:18:36.331904-05:00 c0-0c2s0n3 [<ffffffff811587b5>] sys_read+0x55/0x90
2014-05-15T17:18:36.357172-05:00 c0-0c2s0n3 [<ffffffff81427a2b>] system_call_fastpath+0x16/0x1b
2014-05-15T17:18:36.357219-05:00 c0-0c2s0n3 [<000000002001d680>] 0x2001d680
2014-05-15T17:18:36.357233-05:00 c0-0c2s0n3 LustreError: 28354:0:(osc_io.c:826:osc_req_attr_set()) LBUG

Comment by Jinshan Xiong (Inactive) [ 16/May/14 ]

I will look at this issue.

Comment by Jinshan Xiong (Inactive) [ 16/May/14 ]

Hi Patrick, can you please upload one more crash dump for me? It looks like the previous upload has problem and I can't extract lustre log from there. Please make sure the ko files match the running code.

Comment by Patrick Farrell (Inactive) [ 16/May/14 ]

Ugh, sorry for the incorrect KO files... There's a small issue in our automated dump gathering.

I did this one by hand and checked that the KOs are right. This is the dump for the report I just made a few hours ago:

ftp.whamcloud.com
uploads/LU-5062/LU_5062_140515.tar.gz

The timer is saying 9 minutes to complete the upload.
Upload complete.

Thank you for taking a look.

Comment by Jinshan Xiong (Inactive) [ 16/May/14 ]

Hi Patrick,

I still can't extract lustre log from the crash dump file. This is what I got when I was trying to extract it.

crash> extend /home/jay/crashdump/lustre.so
/home/jay/crashdump/lustre.so: shared object loaded
crash> lustre -l x
lustre_walk_cpus(0, 5, 1)
cmd:	p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages
	p (*cfs_trace_data[0])[0].tcd.tcd_pages.next
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
crash> p cfs_trace_data
cfs_trace_data = $7 = 
 {0x1400000004, 0x554e4700000003, 0x5b9a0443e17b2baa, 0xcd0dd360156499dc, 0x5abd76e8, 0x0, 0xffffffffa022c9e0,
 0xffffffffa024f201}

cfs_trace_data should be an array of pointers. However, the memory cfs_trace_data pointing to seems to be mangled. Can you run crash on your local side to see if you can get the log? Otherwise, please check if the dump gathering script is correct.

Comment by Patrick Farrell (Inactive) [ 16/May/14 ]

Jinshan,

Hmm. The problem might be SLES11SP3... Xyratex modified lustre.so so it could get logs out. I'm not familiar with what they changed. (We also had to update our version of Crash to something more recent than we wer eusing.) I'll attach that file as well.

In any case, I was able to extract the logs. I'll note there did appear to be a few corrupt messages (I got some error messages when using lctl to process the binary logs), so I'll include the binary logs as well as the processed ones. Attaching in a moment.

Comment by Patrick Farrell (Inactive) [ 16/May/14 ]

log is the raw binary log, log.sort is the processed log.

Comment by Jinshan Xiong (Inactive) [ 16/May/14 ]

It turned out that this is a race when a file with empty layout turned into raid0 layout.

Comment by Jodi Levi (Inactive) [ 12/Jun/14 ]

Jinshan,
Are you planning to make a patch for this?

Comment by Patrick Farrell (Inactive) [ 19/Jun/14 ]

Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.].

Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ]

I'll start to work on this.

Comment by Jinshan Xiong (Inactive) [ 19/Jun/14 ]

Hi Patrick,

Can you try this patch: http://review.whamcloud.com/10760?

Comment by Patrick Farrell (Inactive) [ 20/Jun/14 ]

Jinshan - While testing the patch, I hit the assertion from LU-4900, same stack trace as there as well:
2014-06-19T22:02:47.307062-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) ASSERTION( result != -38 ) failed:
2014-06-19T22:02:47.307096-05:00 c0-0c2s6n0 LustreError: 13331:0:(cl_lock.c:1122:cl_use_try()) LBUG

It's probably just random, since that bug's known to exist, but I wanted to let you know in case there's a possible relationship. It's worth noting I don't think we've ever hit LU-4900 before. (Still is likely to be just chance.)

Comment by Jinshan Xiong (Inactive) [ 20/Jun/14 ]

You used patch set 1, is that right?

Comment by Patrick Farrell (Inactive) [ 20/Jun/14 ]

Yes, I did, because I had it built and the system set up before I saw your response in Gerrit, and then I didn't realize the other code affected more than that sanity testing. Oops.

Having looked at LU-4900, I understand now why you did what you did in patch set 2. I see now that's called in the macro in cl_use_try.

So it looks like your change to vvp_io_read_lock is getting this kind of cl_lock in to the cache, and then in to cl_use_try, where it asserts.
And since cl_use_try is only called for cached locks, we don't hit that assertion all the time...

I'll do another run later with patch set 2. Is the data from running only patch set 1 helpful in proving the vvp_io_read_lock part of the patch, or should I just ignore it?

Sorry about that.

Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then?

Comment by Jinshan Xiong (Inactive) [ 20/Jun/14 ]

Just out of curiosity: I'm still a little puzzled. If I'm reading cl_use_try correctly, as long as there's at least one slice with a valid clo_use, rc shouldn't be ENOSYS, so no assertion. Is the error case here a cl_lock with only a single slice, then?

For cl_lock of an empty file, it will have two slices: ccc and lov. However, neither of them has ->clo_use defined this is why cl_use_try() will see ENOSYS error.

I don't need the data from patch set 1, thanks.

Comment by Peter Jones [ 24/Jun/14 ]

Landed for 2.6

Comment by Gerrit Updater [ 01/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12139/
Subject: LU-5062 llite: Solve a race to access lli_has_smd in read case
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 078dee1448c2995257923dc712943eab2c78e48e

Generated at Sat Feb 10 01:48:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.