Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.1.0, Lustre 2.3.0
-
lustre 2.1.2 + a few additional patches
bullxlinux 6.1.1 (rhel6.1.1, kernel 2.6.32)
-
3
-
4053
Description
Our lustre release is made of lustre 2.1.2 plus the following patches:
- ORNL-22 "general ptlrpcd threads pool support"
LU-1144implement a NUMA aware ptlrpcd binding policyLU-1164add the ability to choose the number of ko2iblnd threads at moduleLU-857support SELinuxLU-1110MDS Oops in osd_xattr_get() during file open by FIDLU-1363llite: Not held lock when calling security_d_instantiateLU-948/LU-1059recovery hangLU-969/LU-1408stack overflowLU-645/BZ23978 getcwd failureLU-1428MDT service threads spinning in cfs_hash_for_each_relax()LU-1299loading large enough binary from lustre trigger OOM killerLU-1493assertion failed in dqacq_completion()LU-1194OSS LBUG in_llog_recov_thread_stop() during umount
At CEA, there are several lustre client crashes/LBUGs with this same signature/stack like following :
........ LustreError: 6614:0:(osc_io.c:698:osc_req_attr_set()) no cover page! LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e180[2 ffff8806a0423448:2342912 ^(null)_ffff880b9f71e0c0 3 0 1 (null) ffff880af0057f80 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) page@ffff880b9f71e0c0[1 ffff8806e43a6aa8:780288 ^ffff880b9f71e180_(null) 3 0 1 (null) (null) 0x0] LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) vvp-page@ffff880ba2855640(1:0:0) vm@ffffea0028af1e40 1400000000000801 3:0 ffff880b9f71e180 2342912 lru LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) lov-page@ffff880ba285bf48 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) osc-page@ffff880b9e38d070: 1< 0x845fed 1 0 - - + > 2< 3196059648 0 4096 0x7 0x8 | (null) ffff88102de0e8c8 ffff8806a08d4e40 ffffffffa08e6140 ffff880b9e38d070 > 3< + ffff880a8b481100 1 3668 0 > 4< 0 0 32 69689344 - | - - - - > 5< - - - - | 0 - - | 0 - -> LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) end page@ffff880b9f71e0c0 LustreError: 6614:0:(osc_io.c:700:osc_req_attr_set()) dump uncover page! Pid: 6614, comm: %%USR123_%%%A456 Call Trace: [<ffffffffa03bc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa08e3979>] osc_req_attr_set+0x2f9/0x310 [osc] [<ffffffffa04de979>] cl_req_attr_set+0xc9/0x250 [obdclass] [<ffffffffa08d088b>] osc_send_oap_rpc+0xc2b/0x1b40 [osc] [<ffffffffa03bd86e>] ? cfs_mem_cache_free+0xe/0x10 [libcfs] [<ffffffffa08d1a4e>] osc_check_rpcs+0x2ae/0x4c0 [osc] [<ffffffffa08e4037>] osc_io_submit+0x1e7/0x540 [osc] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa0962a4e>] lov_io_submit+0x4ee/0xc30 [lov] [<ffffffffa04ded00>] cl_io_submit_rw+0x70/0x180 [obdclass] [<ffffffffa04e0f40>] cl_io_read_page+0xb0/0x170 [obdclass] [<ffffffffa04d5349>] ? cl_page_assume+0xf9/0x2d0 [obdclass] [<ffffffffa0a1a6b6>] ll_readpage+0x96/0x200 [lustre] [<ffffffff810fc9dc>] generic_file_aio_read+0x1fc/0x700 [<ffffffffa0a4237b>] vvp_io_read_start+0x13b/0x3e0 [lustre] [<ffffffffa04defca>] cl_io_start+0x6a/0x140 [obdclass] [<ffffffffa04e322c>] cl_io_loop+0xcc/0x190 [obdclass] [<ffffffffa09f1ef7>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09f21e9>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09f26a9>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff8115e355>] vfs_read+0xb5/0x1a0 [<ffffffff8115e491>] sys_read+0x51/0x90 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b LustreError: 6614:0:(osc_io.c:702:osc_req_attr_set()) LBUG Pid: 6614, comm: %%USR123_%%%A456
and also, but much less frequently with a stack like :
LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) failed at lock->cll_state != CLS_HELD. LustreError: 23020:0:(cl_lock.c:906:cl_lock_hold_release()) LBUG Pid: 23020, comm: %%USR789_%%A37_ Call Trace: [<ffffffffa03dc865>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa03dce77>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa04f95e2>] cl_lock_hold_release+0x2a2/0x2b0 [obdclass] [<ffffffffa04fadf2>] cl_lock_release+0x82/0x180 [obdclass] [<ffffffffa0502338>] cl_lock_link_fini+0x68/0x160 [obdclass] [<ffffffffa0502565>] cl_io_unlock+0x135/0x2e0 [obdclass] [<ffffffffa0503245>] cl_io_loop+0xe5/0x190 [obdclass] [<ffffffffa0a5ac13>] cl_setattr_ost+0x1c3/0x240 [lustre] [<ffffffffa0a2e59a>] ll_setattr_raw+0x96a/0xf20 [lustre] [<ffffffffa0a2ebaf>] ll_setattr+0x5f/0x100 [lustre] [<ffffffff811796d8>] notify_change+0x168/0x340 [<ffffffff8115be54>] do_truncate+0x64/0xa0 [<ffffffff8116dff1>] do_filp_open+0x821/0xd30 [<ffffffff8112af80>] ? unmap_region+0x110/0x130 [<ffffffff8117a6a2>] ? alloc_fd+0x92/0x160 [<ffffffff8115ac29>] do_sys_open+0x69/0x140 [<ffffffff8115ad40>] sys_open+0x20/0x30 [<ffffffff81003172>] system_call_fastpath+0x16/0x1b
The second LBUG() is an assertion introduced by LU-1299. So we reverted the patch for LU-1299 and delivered a new version of lustre, but the first LBUG() was still hit by several clients.
I have asked the support team to provide additional debug data (debug log with "dlmtrace" enabled, and a dump image). But unfortunately it will take some time to get these information.
Attachments
Issue Links
- is related to
-
LU-2172 osc_req_attr_set LBUG
-
- Resolved
-
After 6 days on a big server who reproduce the issue each day,
I able to tell you that with the first (patch set 1) fix the Lbug didin't reproduce.
so we can say that the fix is OK and can be landed to an official release
Thanks for all