[LU-1650] crash of lustre clients in osc_req_attr_set() routine - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.4.0, Lustre 2.1.4
Affects Version/s: Lustre 2.1.0, Lustre 2.3.0
Labels:
- cl_lock
- client
- crash
Environment:
lustre 2.1.2 + a few additional patches
bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)

Severity:
3
Rank (Obsolete):
4053

Description

Our lustre release is made of lustre 2.1.2 plus the following patches:

ORNL-22 "general ptlrpcd threads pool support"
~~LU-1144~~ implement a NUMA aware ptlrpcd binding policy
~~LU-1164~~ add the ability to choose the number of ko2iblnd threads at module
~~LU-857~~ support SELinux
~~LU-1110~~ MDS Oops in osd_xattr_get() during file open by FID
~~LU-1363~~ llite: Not held lock when calling security_d_instantiate
~~LU-948~~/~~LU-1059~~ recovery hang
~~LU-969~~/~~LU-1408~~ stack overflow
~~LU-645~~/BZ23978 getcwd failure
~~LU-1428~~ MDT service threads spinning in cfs_hash_for_each_relax()
~~LU-1299~~ loading large enough binary from lustre trigger OOM killer
~~LU-1493~~ assertion failed in dqacq_completion()
~~LU-1194~~ OSS LBUG in_llog_recov_thread_stop() during umount

At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :

........
LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0]
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0]
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - ->
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0
LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
Pid: 6614, comm: %%USR123_%%%A456
 
Call Trace:
[<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc]
[<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass]
[<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc]
[<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
[<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc]
[<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc]
[<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
[<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov]
[<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
[<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass]
[<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
[<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre]
[<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700
[<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
[<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass]
[<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass]
[<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre]
[<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre]
[<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre]
[<ffffffff8115e355>] vfs_read+0xb5/0x1a0
[<ffffffff8115e491>] sys_read+0x51/0x90
[<ffffffff81003172>] system_call_fastpath+0x16/0x1b
 
LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG
Pid: 6614, comm: %%USR123_%%%A456

and also, but much less frequently with a stack like :

LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD.
LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG
Pid: 23020, comm: %%USR789_%%A37_

Call Trace:
  [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
  [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs]
  [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass]
  [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass]
  [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass]
  [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass]
  [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass]
  [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre]
  [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre]
  [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre]
  [<ffffffff811796d8>] notify_change+0x168/0x340
  [<ffffffff8115be54>] do_truncate+0x64/0xa0
  [<ffffffff8116dff1>] do_filp_open+0x821/0xd30
  [<ffffffff8112af80>] ? unmap_region+0x110/0x130
  [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160
  [<ffffffff8115ac29>] do_sys_open+0x69/0x140
  [<ffffffff8115ad40>] sys_open+0x20/0x30
  [<ffffffff81003172>] system_call_fastpath+0x16/0x1b

The second LBUG() is an assertion introduced by ~~LU-1299~~. So we reverted the patch for ~~LU-1299~~ and delivered a new version of lustre, but the first LBUG() was still hit by several clients.

I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cartan1139_crash_debug.txt
449 kB
23/Jul/12 4:48 AM
cartan1206_crash_debug.txt
128 kB
23/Jul/12 4:48 AM
console.hype205
39 kB
30/Jul/12 1:13 PM
console-grove409.txt
39 kB
30/Jul/12 3:45 PM
LU-1650_debug.patch
4 kB
18/Oct/12 10:20 AM
LU1650_trace_cl_lock_at_page_dup_VS_osc_req_attr_set_LBUG.tgz
2.74 MB
23/Nov/12 7:51 PM

Issue Links

is related to

LU-2172 osc_req_attr_set LBUG

Resolved

Activity

[LU-1650] crash of lustre clients in osc_req_attr_set() routine

Peter Jones added a comment - 19/Dec/12 9:14 AM

Thanks. This fix has been landed for 2.1.4 and 2.4

Peter Jones added a comment - 19/Dec/12 9:14 AM Thanks. This fix has been landed for 2.1.4 and 2.4

Antoine Percher added a comment - 10/Dec/12 2:49 AM

After 6 days on a big server who reproduce the issue each day,
I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce.
so we can say that the fix is OK and can be landed to an official release
Thanks for all

Antoine Percher added a comment - 10/Dec/12 2:49 AM After 6 days on a big server who reproduce the issue each day, I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce. so we can say that the fix is OK and can be landed to an official release Thanks for all

Peter Jones added a comment - 29/Nov/12 2:35 PM

Alain

We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix.

Peter

Peter Jones added a comment - 29/Nov/12 2:35 PM Alain We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix. Peter

Alain GOB added a comment - 29/Nov/12 5:48 AM

Dear,
Patch was apply and currently running on testing system. Unfortunately, if this issue is now fix, will not see anything an the testing system. For now we don't see other issue happening on it. We will let you known about any problem. To definitely confirm that this issue is fixed, we will have to run on prod system with consequent number of node and relevant crash statistics. FYI, on testing system we have 1 crash a weeks, on prod system we see 5 crashs by day. Did you known when this fix will be landed to be include on official bull release and push on production system?
Regards,
Alain GOB.

Alain GOB added a comment - 29/Nov/12 5:48 AM Dear, Patch was apply and currently running on testing system. Unfortunately, if this issue is now fix, will not see anything an the testing system. For now we don't see other issue happening on it. We will let you known about any problem. To definitely confirm that this issue is fixed, we will have to run on prod system with consequent number of node and relevant crash statistics. FYI, on testing system we have 1 crash a weeks, on prod system we see 5 crashs by day. Did you known when this fix will be landed to be include on official bull release and push on production system? Regards, Alain GOB.

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 3:22 PM

Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks.

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 3:22 PM Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks.

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 2:02 PM

I've found the root cause, working on a patch...

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 2:02 PM I've found the root cause, working on a patch...

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 1:47 PM - edited

Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen.

There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code.

Jinshan Xiong (Inactive) added a comment - 26/Nov/12 1:47 PM - edited Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen. There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code.

Bruno Faccini (Inactive) added a comment - 26/Nov/12 12:46 PM

Ok just as I expected, to see it "readable" you have to edit the comment, sorry.

Bruno Faccini (Inactive) added a comment - 26/Nov/12 12:46 PM Ok just as I expected, to see it "readable" you have to edit the comment, sorry.

Bruno Faccini (Inactive) added a comment - 26/Nov/12 12:43 PM - edited

Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ...

Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way :

– pages

stripe

stripe x count

File linearly represented :

...--|-~~...~~-|-~~...~~-|-~~...~~-|-~~...~~-|--...
^ ^

_ 1st page of next chunk
page of file->pos
+ read(fd,,size)

no let's see how it looks in the Web interface ...

Bruno Faccini (Inactive) added a comment - 26/Nov/12 12:43 PM - edited Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ... Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way : – pages stripe stripe x count File linearly represented : ...-- | - ... - | - ... - | - ... - | - ... - | --... ^ ^ _ 1st page of next chunk page of file->pos + read(fd,,size) no let's see how it looks in the Web interface ...

Antoine Percher added a comment - 26/Nov/12 8:35 AM

Did you see the LBUG message printed ?
on the trace_debug_germain1174_LBUG_cl_lock_at_page_dup.txt file, You can
saw also the printed struct of the lock ffff88004e087cf8,
the cll_state is CLS_FREEING for this lock

I also notice that for each issue/lbug, the page number who
make the lbug is a multiple of the page stripe size.

For example: In my configuration the stripe count is one mega (1m = 256 pages)
and on the germain1174 node, the page was page@ 261168 = 1028*256

May be the lock was canceled/deleted just one bit/page too early ?

Antoine Percher added a comment - 26/Nov/12 8:35 AM Did you see the LBUG message printed ? on the trace_debug_germain1174_LBUG_cl_lock_at_page_dup.txt file, You can saw also the printed struct of the lock ffff88004e087cf8, the cll_state is CLS_FREEING for this lock I also notice that for each issue/lbug, the page number who make the lbug is a multiple of the page stripe size. For example: In my configuration the stripe count is one mega (1m = 256 pages) and on the germain1174 node, the page was page@ 261168 = 1028*256 May be the lock was canceled/deleted just one bit/page too early ?

People

Assignee:: Lai Siyao

Reporter:: Gregoire Pichon

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 20/Jul/12 11:55 AM

Updated:: 19/Dec/12 9:14 AM

Resolved:: 19/Dec/12 9:14 AM