Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
The error occurred during soak testing of master via build '20151209' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20151209). DNE is enabled. MDTs had been formatted using ldiskfs, OSTs using zfs. MDSes are configured in active-active HA - configuration.
During normal operations (no fault injected) two Lustre client nodes hit the LBUG listed below:
- lola-26 – 192.168.1.126 – Dec 9 21:41:59
- lola-27 – 192.168.1.127 – Dec 9 21:41:40
Dec 9 21:41:40 lola-27 kernel: LustreError: 3786:0:(file.c:3891:ll_layout_lock_set()) ASSERTION( ldlm_has_layout(lock) ) f ailed: Dec 9 21:41:40 lola-27 kernel: LustreError: 3786:0:(file.c:3891:ll_layout_lock_set()) LBUG Dec 9 21:41:40 lola-27 kernel: Pid: 3786, comm: flush-lustre-1 Dec 9 21:41:40 lola-27 kernel: Dec 9 21:41:40 lola-27 kernel: Call Trace: Dec 9 21:41:40 lola-27 kernel: [<ffffffffa045f875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa045fe77>] lbug_with_loc+0x47/0xb0 [libcfs] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a04b89>] ll_layout_lock_set+0xa9/0x1360 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a03b5a>] ? ll_take_md_lock+0xfa/0x4b0 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a08fc1>] ll_layout_refresh_locked+0xe1/0xe00 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa058b7f1>] ? cl_io_slice_add+0xc1/0x190 [obdclass] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a37c20>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa072f350>] ? ldlm_completion_ast+0x0/0x9b0 [ptlrpc] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0470aa7>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a09e79>] ll_layout_refresh+0x199/0x300 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa058b7f1>] ? cl_io_slice_add+0xc1/0x190 [obdclass] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a56c8f>] vvp_io_init+0x39f/0x480 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa047377a>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa058a3a8>] cl_io_init0+0x88/0x150 [obdclass] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa058d4a4>] cl_io_init+0x64/0xe0 [obdclass] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a04022>] cl_sync_file_range+0x112/0x2f0 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffffa0a2cd7c>] ll_writepages+0x9c/0x220 [lustre] Dec 9 21:41:40 lola-27 kernel: [<ffffffff81139871>] do_writepages+0x21/0x40 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bb19d>] writeback_single_inode+0xdd/0x290 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bb59d>] writeback_sb_inodes+0xbd/0x170 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bb6fb>] writeback_inodes_wb+0xab/0x1b0 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bbaf3>] wb_writeback+0x2f3/0x410 Dec 9 21:41:40 lola-27 kernel: [<ffffffff810880b2>] ? del_timer_sync+0x22/0x30 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bbdb5>] wb_do_writeback+0x1a5/0x240 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811bbeb3>] bdi_writeback_task+0x63/0x1b0 Dec 9 21:41:40 lola-27 kernel: [<ffffffff8109eaa7>] ? bit_waitqueue+0x17/0xd0 Dec 9 21:41:40 lola-27 kernel: [<ffffffff81148620>] ? bdi_start_fn+0x0/0x100 Dec 9 21:41:40 lola-27 kernel: [<ffffffff811486a6>] bdi_start_fn+0x86/0x100 Dec 9 21:41:40 lola-27 kernel: [<ffffffff81148620>] ? bdi_start_fn+0x0/0x100 Dec 9 21:41:40 lola-27 kernel: [<ffffffff8109e78e>] kthread+0x9e/0xc0 Dec 9 21:41:40 lola-27 kernel: [<ffffffff8100c28a>] child_rip+0xa/0x20 Dec 9 21:41:40 lola-27 kernel: [<ffffffff8109e6f0>] ? kthread+0x0/0xc0 Dec 9 21:41:40 lola-27 kernel: [<ffffffff8100c280>] ? child_rip+0x0/0x20 Dec 9 21:41:40 lola-27 kernel: Dec 9 21:41:40 lola-27 kernel: Kernel panic - not syncing: LBUG
The errors temporal correlate the errors OSS nodes (lola-2,3 of the form:
lola-2.log:Dec 9 21:41:48 lola-2 kernel: LustreError: 28806:0:(ldlm_lockd.c:689:ldlm_handle_ast_error()) ### client (nid 1 92.168.1.126@o2ib100) failed to reply to blocking AST (req status 0 rc -11), evict it ns: filter-soaked-OST0004_UUID lock: ffff880377d872c0/0xef6ba6a3129d2917 lrc: 4/0,0 mode: PR/PR res: [0x500000406:0xfa062d:0x0].0x0 rrc: 2 type: EXT [0->1844674 4073709551615] (req 0->18446744073709551615) flags: 0x60000000010020 nid: 192.168.1.126@o2ib100 remote: 0x6044879e61dc398 e xpref: 33311 pid: 27253 timeout: 4297781214 lvb_type: 1
Several messages on the OSS nodes can be found in (attached) messages files for both OSSes.
Attached files:
- lola-26,27 - messages, console, vmcore-dmesg.txt files
- lola-2,3 - messages, console files
Attachments
Issue Links
- is duplicated by
-
LU-14780 LustreError: 4936:0:(file.c:4985:ll_layout_lock_set()) LBUG
- Resolved