[LU-5062] LBUG: osc_req_attr_set - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.6.0, Lustre 2.5.4
Affects Version/s: Lustre 2.6.0
Labels:
- osc
- readahead
Environment:
SLES11SP3 clients, master (v2_5_58_0-105-gb7f1952), Cray stress testing.

Severity:
3
Rank (Obsolete):
13981

Description

Cray has recently begun stress testing master clients in preparation for 2.6.0.

We hit this bug on several nodes during a stress run last night:

2014-05-13T23:00:06.485598-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) page@ffff880221781c00[4 ffff880222618600 3 0 1           (null) ffff88022328d740 0x0]
2014-05-13T23:00:06.485630-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) vvp-page@ffff880221781ca0(0:0:0) vm@ffffea000e1502a0 200000000000801 4:0 ffff880221781c00 0 lru
2014-05-13T23:00:06.485637-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) lov-page@ffff880221781cf8, raid0
2014-05-13T23:00:06.485672-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) osc-page@ffff880221781d60 0: 1< 0x845fed 1 0 + + > 2< 0 0 4096 0x7 0x8 |           (null) ffff88044ad844f8 ffff88044d7a59e8 > 3< + ffff88044e055080 1 16 0 > 4< 0 0 8 35586048 - | - - - + > 5< - - - + | 0 - | 0 - ->
2014-05-13T23:00:06.485681-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) end page@ffff880221781c00
2014-05-13T23:00:06.485689-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:824:osc_req_attr_set()) dump uncover page!
2014-05-13T23:00:06.485700-05:00 c1-0c0s2n1 Pid: 14298, comm: nsystst
2014-05-13T23:00:06.485710-05:00 c1-0c0s2n1 Call Trace:
2014-05-13T23:00:06.485724-05:00 c1-0c0s2n1 [<ffffffff81005eb9>] try_stack_unwind+0x169/0x1b0
2014-05-13T23:00:06.485746-05:00 c1-0c0s2n1 [<ffffffff81004919>] dump_trace+0x89/0x450
2014-05-13T23:00:06.485757-05:00 c1-0c0s2n1 [<ffffffffa023b8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2014-05-13T23:00:06.485784-05:00 c1-0c0s2n1 [<ffffffffa092f348>] osc_req_attr_set+0x6b8/0x740 [osc]
2014-05-13T23:00:06.485796-05:00 c1-0c0s2n1 [<ffffffffa038dbf1>] cl_req_attr_set+0xd1/0x220 [obdclass]
2014-05-13T23:00:06.485807-05:00 c1-0c0s2n1 [<ffffffffa091a6da>] osc_build_rpc+0x4fa/0x15d0 [osc]
2014-05-13T23:00:06.485817-05:00 c1-0c0s2n1 [<ffffffffa093c199>] osc_io_unplug0+0x1419/0x2020 [osc]
2014-05-13T23:00:06.485837-05:00 c1-0c0s2n1 [<ffffffffa093ddb1>] osc_io_unplug+0x11/0x20 [osc]
2014-05-13T23:00:06.485844-05:00 c1-0c0s2n1 [<ffffffffa093dfa1>] osc_queue_sync_pages+0x1e1/0x380 [osc]
2014-05-13T23:00:06.485852-05:00 c1-0c0s2n1 [<ffffffffa092f90d>] osc_io_submit+0x2dd/0x4e0 [osc]
2014-05-13T23:00:06.485871-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-13T23:00:06.485881-05:00 c1-0c0s2n1 [<ffffffffa076123e>] lov_io_submit+0x2ee/0x4d0 [lov]
2014-05-13T23:00:06.485894-05:00 c1-0c0s2n1 [<ffffffffa038d40e>] cl_io_submit_rw+0x6e/0x170 [obdclass]
2014-05-13T23:00:06.485900-05:00 c1-0c0s2n1 [<ffffffffa038f3fa>] cl_io_read_page+0x19a/0x1b0 [obdclass]
2014-05-13T23:00:06.485907-05:00 c1-0c0s2n1 [<ffffffffa081bf7c>] ll_readpage+0x16c/0x1b0 [lustre]
2014-05-13T23:00:06.485920-05:00 c1-0c0s2n1 [<ffffffff810fdc68>] generic_file_aio_read+0x268/0x740
2014-05-13T23:00:06.485932-05:00 c1-0c0s2n1 [<ffffffffa084e0b0>] vvp_io_read_start+0x2a0/0x430 [lustre]
2014-05-13T23:00:06.485938-05:00 c1-0c0s2n1 [<ffffffffa038d6b2>] cl_io_start+0x72/0x140 [obdclass]
2014-05-13T23:00:06.485947-05:00 c1-0c0s2n1 [<ffffffffa0391234>] cl_io_loop+0xb4/0x1b0 [obdclass]
2014-05-13T23:00:06.485959-05:00 c1-0c0s2n1 [<ffffffffa07e9ccc>] ll_file_io_generic+0x58c/0x8a0 [lustre]
2014-05-13T23:00:06.485967-05:00 c1-0c0s2n1 [<ffffffffa07ea49c>] ll_file_aio_read+0x22c/0x290 [lustre]
2014-05-13T23:00:06.485994-05:00 c1-0c0s2n1 [<ffffffff8115517b>] do_sync_readv_writev+0xdb/0x120
2014-05-13T23:00:06.486000-05:00 c1-0c0s2n1 [<ffffffff81156084>] do_readv_writev+0xd4/0x1e0
2014-05-13T23:00:06.486010-05:00 c1-0c0s2n1 [<ffffffff811563c5>] vfs_readv+0x45/0x60
2014-05-13T23:00:06.486016-05:00 c1-0c0s2n1 [<ffffffff81156505>] sys_readv+0x55/0xc0
2014-05-13T23:00:06.486023-05:00 c1-0c0s2n1 [<ffffffff8142286b>] system_call_fastpath+0x16/0x1b
2014-05-13T23:00:06.486031-05:00 c1-0c0s2n1 [<000000002006d41b>] 0x2006d41b
2014-05-13T23:00:06.510785-05:00 c1-0c0s2n1 LustreError: 14298:0:(osc_io.c:826:osc_req_attr_set()) LBUG

I'll make the dump+logs available shortly.

I note this bug was seen by Oleg during patch review for the readahead changes in ~~LU-3321~~, see his comments here:

http://review.whamcloud.com/#/c/8523/

I also note a previous instance of this assertion was fixed in ~~LU-1650~~:
https://jira.hpdd.intel.com/browse/LU-1650

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log
693 kB
16/May/14 6:32 PM
log.sort
573 kB
16/May/14 6:32 PM
lustre-ext.so
132 kB
16/May/14 6:32 PM

Issue Links

is related to

LU-5685 (cl_lock.c:1128:cl_use_try()) ASSERTION( result != -38 )

Closed

Activity

[LU-5062] LBUG: osc_req_attr_set

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 11:33 PM

Hi Patrick,

Can you try this patch: http://review.whamcloud.com/10760?

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 11:33 PM Hi Patrick, Can you try this patch: http://review.whamcloud.com/10760?

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 5:37 PM

I'll start to work on this.

Jinshan Xiong (Inactive) added a comment - 19/Jun/14 5:37 PM I'll start to work on this.

Patrick Farrell (Inactive) added a comment - 19/Jun/14 2:48 PM - edited

Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.].

Also: It looks like ~~LU-3729~~ may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

Patrick Farrell (Inactive) added a comment - 19/Jun/14 2:48 PM - edited Cray hit this again today [Update: Looking internally, we've hit it several times recently in testing.] . Also: It looks like LU-3729 may be a duplicate of this? Though there have been several osc_req_attr_set cases fixed since that bug was opened, I believe.

Jodi Levi (Inactive) added a comment - 12/Jun/14 6:12 PM

Jinshan,
Are you planning to make a patch for this?

Jodi Levi (Inactive) added a comment - 12/Jun/14 6:12 PM Jinshan, Are you planning to make a patch for this?

Jinshan Xiong (Inactive) added a comment - 16/May/14 8:06 PM

It turned out that this is a race when a file with empty layout turned into raid0 layout.

Jinshan Xiong (Inactive) added a comment - 16/May/14 8:06 PM It turned out that this is a race when a file with empty layout turned into raid0 layout.

Patrick Farrell (Inactive) added a comment - 16/May/14 6:32 PM

log is the raw binary log, log.sort is the processed log.

Patrick Farrell (Inactive) added a comment - 16/May/14 6:32 PM log is the raw binary log, log.sort is the processed log.

Patrick Farrell (Inactive) added a comment - 16/May/14 6:32 PM

Jinshan,

Hmm. The problem might be SLES11SP3... Xyratex modified lustre.so so it could get logs out. I'm not familiar with what they changed. (We also had to update our version of Crash to something more recent than we wer eusing.) I'll attach that file as well.

In any case, I was able to extract the logs. I'll note there did appear to be a few corrupt messages (I got some error messages when using lctl to process the binary logs), so I'll include the binary logs as well as the processed ones. Attaching in a moment.

Patrick Farrell (Inactive) added a comment - 16/May/14 6:32 PM Jinshan, Hmm. The problem might be SLES11SP3... Xyratex modified lustre.so so it could get logs out. I'm not familiar with what they changed. (We also had to update our version of Crash to something more recent than we wer eusing.) I'll attach that file as well. In any case, I was able to extract the logs. I'll note there did appear to be a few corrupt messages (I got some error messages when using lctl to process the binary logs), so I'll include the binary logs as well as the processed ones. Attaching in a moment.

Jinshan Xiong (Inactive) added a comment - 16/May/14 4:49 PM

Hi Patrick,

I still can't extract lustre log from the crash dump file. This is what I got when I was trying to extract it.

crash> extend /home/jay/crashdump/lustre.so
/home/jay/crashdump/lustre.so: shared object loaded
crash> lustre -l x
lustre_walk_cpus(0, 5, 1)
cmd:	p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages
	p (*cfs_trace_data[0])[0].tcd.tcd_pages.next
lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages"
crash> p cfs_trace_data
cfs_trace_data = $7 = 
 {0x1400000004, 0x554e4700000003, 0x5b9a0443e17b2baa, 0xcd0dd360156499dc, 0x5abd76e8, 0x0, 0xffffffffa022c9e0,
 0xffffffffa024f201}

cfs_trace_data should be an array of pointers. However, the memory cfs_trace_data pointing to seems to be mangled. Can you run crash on your local side to see if you can get the log? Otherwise, please check if the dump gathering script is correct.

Jinshan Xiong (Inactive) added a comment - 16/May/14 4:49 PM Hi Patrick, I still can't extract lustre log from the crash dump file. This is what I got when I was trying to extract it. crash> extend /home/jay/crashdump/lustre.so /home/jay/crashdump/lustre.so: shared object loaded crash> lustre -l x lustre_walk_cpus(0, 5, 1) cmd: p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages p (*cfs_trace_data[0])[0].tcd.tcd_pages.next lustre: gdb request failed: "p (*cfs_trace_data[0])[0].tcd.tcd_cur_pages" crash> p cfs_trace_data cfs_trace_data = $7 = {0x1400000004, 0x554e4700000003, 0x5b9a0443e17b2baa, 0xcd0dd360156499dc, 0x5abd76e8, 0x0, 0xffffffffa022c9e0, 0xffffffffa024f201} cfs_trace_data should be an array of pointers. However, the memory cfs_trace_data pointing to seems to be mangled. Can you run crash on your local side to see if you can get the log? Otherwise, please check if the dump gathering script is correct.

Patrick Farrell (Inactive) added a comment - 16/May/14 1:35 AM - edited

Ugh, sorry for the incorrect KO files... There's a small issue in our automated dump gathering.

I did this one by hand and checked that the KOs are right. This is the dump for the report I just made a few hours ago:

ftp.whamcloud.com
uploads/~~LU-5062~~/LU_5062_140515.tar.gz

~~The timer is saying 9 minutes to complete the upload.~~
Upload complete.

Thank you for taking a look.

Patrick Farrell (Inactive) added a comment - 16/May/14 1:35 AM - edited Ugh, sorry for the incorrect KO files... There's a small issue in our automated dump gathering. I did this one by hand and checked that the KOs are right. This is the dump for the report I just made a few hours ago: ftp.whamcloud.com uploads/ LU-5062 /LU_5062_140515.tar.gz The timer is saying 9 minutes to complete the upload. Upload complete. Thank you for taking a look.

Jinshan Xiong (Inactive) added a comment - 16/May/14 12:49 AM

Hi Patrick, can you please upload one more crash dump for me? It looks like the previous upload has problem and I can't extract lustre log from there. Please make sure the ko files match the running code.

Jinshan Xiong (Inactive) added a comment - 16/May/14 12:49 AM Hi Patrick, can you please upload one more crash dump for me? It looks like the previous upload has problem and I can't extract lustre log from there. Please make sure the ko files match the running code.

Jinshan Xiong (Inactive) added a comment - 16/May/14 12:06 AM

I will look at this issue.

Jinshan Xiong (Inactive) added a comment - 16/May/14 12:06 AM I will look at this issue.

People

Assignee:: Jinshan Xiong (Inactive)

Reporter:: Patrick Farrell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/May/14 8:10 PM

Updated:: 20/Jan/15 9:58 PM

Resolved:: 24/Jun/14 5:13 PM