Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.1.0, Lustre 2.3.0
-
lustre 2.1.2 + a few additional patches
bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)
-
3
-
4053
Description
Our lustre release is made of lustre 2.1.2 plus the following patches:
- ORNL-22 "general ptlrpcd threads pool support"
LU-1144implement a NUMA aware ptlrpcd binding policyLU-1164add the ability to choose the number of ko2iblnd threads at moduleLU-857support SELinuxLU-1110MDS Oops in osd_xattr_get() during file open by FIDLU-1363llite: Not held lock when calling security_d_instantiateLU-948/LU-1059recovery hangLU-969/LU-1408stack overflowLU-645/BZ23978 getcwd failureLU-1428MDT service threads spinning in cfs_hash_for_each_relax()LU-1299loading large enough binary from lustre trigger OOM killerLU-1493assertion failed in dqacq_completion()LU-1194OSS LBUG in_llog_recov_thread_stop() during umount
At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :
........ LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page! LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - -> LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page! Pid: 6614, comm: %%USR123_%%%A456 Call Trace: [<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc] [<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass] [<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc] [<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc] [<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass] [<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass] [<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre] [<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700 [<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre] [<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff8115e355>] vfs_read+0xb5/0x1a0 [<ffffffff8115e491>] sys_read+0x51/0x90 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG Pid: 6614, comm: %%USR123_%%%A456
and also, but much less frequently with a stack like :
LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD. LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG Pid: 23020, comm: %%USR789_%%A37_ Call Trace: [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass] [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass] [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass] [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass] [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass] [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre] [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre] [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre] [<ffffffff811796d8>] notify_change+0x168/0x340 [<ffffffff8115be54>] do_truncate+0x64/0xa0 [<ffffffff8116dff1>] do_filp_open+0x821/0xd30 [<ffffffff8112af80>] ? unmap_region+0x110/0x130 [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160 [<ffffffff8115ac29>] do_sys_open+0x69/0x140 [<ffffffff8115ad40>] sys_open+0x20/0x30 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
The second LBUG() is an assertion introduced by LU-1299. So we reverted the patch for LU-1299 and delivered a new version of lustre, but the first LBUG() was still hit by several clients.
I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.
Attachments
Issue Links
- is related to
-
LU-2172 osc_req_attr_set LBUG
- Resolved