Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1650

crash of lustre clients in osc_req_attr_set() routine

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0, Lustre 2.1.4
    • Lustre 2.1.0, Lustre 2.3.0
    • lustre 2.1.2 + a few additional patches
      bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)
    • 3
    • 4053

    Description

      Our lustre release is made of lustre 2.1.2 plus the following patches:

      • ORNL-22 "general ptlrpcd threads pool support"
      • LU-1144 implement a NUMA aware ptlrpcd binding policy
      • LU-1164 add the ability to choose the number of ko2iblnd threads at module
      • LU-857 support SELinux
      • LU-1110 MDS Oops in osd_xattr_get() during file open by FID
      • LU-1363 llite: Not held lock when calling security_d_instantiate
      • LU-948/LU-1059 recovery hang
      • LU-969/LU-1408 stack overflow
      • LU-645/BZ23978 getcwd failure
      • LU-1428 MDT service threads spinning in cfs_hash_for_each_relax()
      • LU-1299 loading large enough binary from lustre trigger OOM killer
      • LU-1493 assertion failed in dqacq_completion()
      • LU-1194 OSS LBUG in_llog_recov_thread_stop() during umount

      At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :

      ........
      LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page!
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0]
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0]
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - ->
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0
      LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page!
      Pid: 6614, comm: %%USR123_%%%A456
       
      Call Trace:
      [<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc]
      [<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass]
      [<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc]
      [<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs]
      [<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc]
      [<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc]
      [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
      [<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov]
      [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass]
      [<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass]
      [<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass]
      [<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre]
      [<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700
      [<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre]
      [<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass]
      [<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass]
      [<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre]
      [<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre]
      [<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre]
      [<ffffffff8115e355>] vfs_read+0xb5/0x1a0
      [<ffffffff8115e491>] sys_read+0x51/0x90
      [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
       
      LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG
      Pid: 6614, comm: %%USR123_%%%A456
      

      and also, but much less frequently with a stack like :

      LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD.
      LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG
      Pid: 23020, comm: %%USR789_%%A37_
      
      Call Trace:
        [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
        [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs]
        [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass]
        [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass]
        [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass]
        [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass]
        [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass]
        [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre]
        [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre]
        [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre]
        [<ffffffff811796d8>] notify_change+0x168/0x340
        [<ffffffff8115be54>] do_truncate+0x64/0xa0
        [<ffffffff8116dff1>] do_filp_open+0x821/0xd30
        [<ffffffff8112af80>] ? unmap_region+0x110/0x130
        [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160
        [<ffffffff8115ac29>] do_sys_open+0x69/0x140
        [<ffffffff8115ad40>] sys_open+0x20/0x30
        [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
      

      The second LBUG() is an assertion introduced by LU-1299. So we reverted the patch for LU-1299 and delivered a new version of lustre, but the first LBUG() was still hit by several clients.

      I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.

      Attachments

        1. cartan1139_crash_debug.txt
          449 kB
          Gregoire Pichon
        2. cartan1206_crash_debug.txt
          128 kB
          Gregoire Pichon
        3. console.hype205
          39 kB
          Christopher Morrone
        4. console-grove409.txt
          39 kB
          Prakash Surya
        5. LU-1650_debug.patch
          4 kB
          Bruno Faccini
        6. LU1650_trace_cl_lock_at_page_dup_VS_osc_req_attr_set_LBUG.tgz
          2.74 MB
          Bruno Faccini

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              pichong Gregoire Pichon
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: