Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1650

crash of lustre clients in osc_req_attr_set() routine

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0, Lustre 2.1.4
    • Lustre 2.1.0, Lustre 2.3.0
    • lustre 2.1.2 + a few additional patches
      bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)
    • 3
    • 4053

    Description

      Our lustre release is made of lustre 2.1.2 plus the following patches:

      • ORNL-22 "general ptlrpcd threads pool support"
      • LU-1144 implement a NUMA aware ptlrpcd binding policy
      • LU-1164 add the ability to choose the number of ko2iblnd threads at module
      • LU-857 support SELinux
      • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
      • LU-1363 llite: Not held lock when calling security_d_instantiate
      • LU-948/LU-1059 recovery hang
      • LU-969/LU-1408 stack overflow
      • LU-645/BZ23978 getcwd failure
      • LU-1428 MDT service threads spinning in cfs_hash_for_each_relax()
      • LU-1299 loading large enough binary from lustre trigger OOM killer
      • LU-1493 assertion failed in dqacq_completion()
      • LU-1194 OSS LBUG in_llog_recov_thread_stop() during umount

      At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :

      ........
      LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0]
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0]
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - ->
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
      Pid: 6614, comm: %%USR123_%%%A456
       
      Call Trace:
      [<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc]
      [<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass]
      [<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc]
      [<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
      [<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc]
      [<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc]
      [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
      [<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov]
      [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
      [<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass]
      [<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
      [<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre]
      [<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700
      [<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
      [<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass]
      [<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass]
      [<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre]
      [<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre]
      [<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre]
      [<ffffffff8115e355>] vfs_read+0xb5/0x1a0
      [<ffffffff8115e491>] sys_read+0x51/0x90
      [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
       
      LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG
      Pid: 6614, comm: %%USR123_%%%A456
      

      and also, but much less frequently with a stack like :

      LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD.
      LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG
      Pid: 23020, comm: %%USR789_%%A37_
      
      Call Trace:
        [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs]
        [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass]
        [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass]
        [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass]
        [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass]
        [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass]
        [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre]
        [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre]
        [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre]
        [<ffffffff811796d8>] notify_change+0x168/0x340
        [<ffffffff8115be54>] do_truncate+0x64/0xa0
        [<ffffffff8116dff1>] do_filp_open+0x821/0xd30
        [<ffffffff8112af80>] ? unmap_region+0x110/0x130
        [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160
        [<ffffffff8115ac29>] do_sys_open+0x69/0x140
        [<ffffffff8115ad40>] sys_open+0x20/0x30
        [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
      

      The second LBUG() is an assertion introduced by LU-1299. So we reverted the patch for LU-1299 and delivered a new version of lustre, but the first LBUG() was still hit by several clients.

      I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.

      Attachments

        Issue Links

          Activity

            [LU-1650] crash of lustre clients in osc_req_attr_set() routine
            pjones Peter Jones added a comment -

            Thanks. This fix has been landed for 2.1.4 and 2.4

            pjones Peter Jones added a comment - Thanks. This fix has been landed for 2.1.4 and 2.4

            After 6 days on a big server who reproduce the issue each day,
            I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce.
            so we can say that the fix is OK and can be landed to an official release
            Thanks for all

            apercher Antoine Percher added a comment - After 6 days on a big server who reproduce the issue each day, I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce. so we can say that the fix is OK and can be landed to an official release Thanks for all
            pjones Peter Jones added a comment -

            Alain

            We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix.

            Peter

            pjones Peter Jones added a comment - Alain We are testing and reviewing the patch in parallel with your testing. It sounds to me like we would want to see at least two weeks of the test running on your test cluster before we could confidently assert that the situation has been improved by the fix. We'll aim to get the fix landed to our tree and thus expose it to broader testing to raise confidence that there are no adverse side-effects for this fix. Peter
            alaingob@free.fr Alain GOB added a comment -

            Dear,
            Patch was apply and currently running on testing system. Unfortunately, if this issue is now fix, will not see anything an the testing system. For now we don't see other issue happening on it. We will let you known about any problem. To definitely confirm that this issue is fixed, we will have to run on prod system with consequent number of node and relevant crash statistics. FYI, on testing system we have 1 crash a weeks, on prod system we see 5 crashs by day. Did you known when this fix will be landed to be include on official bull release and push on production system?
            Regards,
            Alain GOB.

            alaingob@free.fr Alain GOB added a comment - Dear, Patch was apply and currently running on testing system. Unfortunately, if this issue is now fix, will not see anything an the testing system. For now we don't see other issue happening on it. We will let you known about any problem. To definitely confirm that this issue is fixed, we will have to run on prod system with consequent number of node and relevant crash statistics. FYI, on testing system we have 1 crash a weeks, on prod system we see 5 crashs by day. Did you known when this fix will be landed to be include on official bull release and push on production system? Regards, Alain GOB.

            Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks.

            jay Jinshan Xiong (Inactive) added a comment - Can you please apply this patch http://review.whamcloud.com/4673 and test it again? Thanks.

            I've found the root cause, working on a patch...

            jay Jinshan Xiong (Inactive) added a comment - I've found the root cause, working on a patch...
            jay Jinshan Xiong (Inactive) added a comment - - edited

            Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen.

            There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code.

            jay Jinshan Xiong (Inactive) added a comment - - edited Hi Antoine and Bruno, thanks for the analysis. It's normal that a conflicting lock is canceled during IO time. For this case, lock ffff88004e087cf8 is conflicting with a queueing write lock, and it's covering 121 readahead pages so the process didn't hold this lock. However, if everything is okay, those pages should be discarded when the lock is canceled but apparently that didn't happen. There must be some race window in cl_lock_page_out() and check_and_discard_cb(). I'm checking the code.

            Ok just as I expected, to see it "readable" you have to edit the comment, sorry.

            bfaccini Bruno Faccini (Inactive) added a comment - Ok just as I expected, to see it "readable" you have to edit the comment, sorry.
            bfaccini Bruno Faccini (Inactive) added a comment - - edited

            Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ...

            Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way :

            – pages

            stripe
            stripe x count

            File linearly represented :

            ...--|-...-|-...-|-...-|-...-|--...
            ^ ^

              _ 1st page of next chunk
            page of file->pos
            + read(fd,,size)

            no let's see how it looks in the Web interface ...

            bfaccini Bruno Faccini (Inactive) added a comment - - edited Since we worked on that together, to enhance what Antoine started to detail, it seems to us that the page/index causing the LBUG, because missing a covering cl_lock, is the 1st of/in the just next/adjacent stripe-size x stripe-count chunk than the chunk containing the page where the file position plus the read size/offset stands ... Hummm, not sure to be clear when trying to be detailed here !! So let's try to show it an other way : – pages stripe stripe x count File linearly represented : ...-- | - ... - | - ... - | - ... - | - ... - | --... ^ ^   _ 1st page of next chunk page of file->pos + read(fd,,size) no let's see how it looks in the Web interface ...

            Did you see the LBUG message printed ?
            on the trace_debug_germain1174_LBUG_cl_lock_at_page_dup.txt file, You can
            saw also the printed struct of the lock ffff88004e087cf8,
            the cll_state is CLS_FREEING for this lock

            I also notice that for each issue/lbug, the page number who
            make the lbug is a multiple of the page stripe size.

            For example: In my configuration the stripe count is one mega (1m = 256 pages)
            and on the germain1174 node, the page was page@ 261168 = 1028*256

            May be the lock was canceled/deleted just one bit/page too early ?

            apercher Antoine Percher added a comment - Did you see the LBUG message printed ? on the trace_debug_germain1174_LBUG_cl_lock_at_page_dup.txt file, You can saw also the printed struct of the lock ffff88004e087cf8, the cll_state is CLS_FREEING for this lock I also notice that for each issue/lbug, the page number who make the lbug is a multiple of the page stripe size. For example: In my configuration the stripe count is one mega (1m = 256 pages) and on the germain1174 node, the page was page@ 261168 = 1028*256 May be the lock was canceled/deleted just one bit/page too early ?

            People

              laisiyao Lai Siyao
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: