[LU-1596] mds crash after recovery finished Created: 03/Jul/12 Updated: 20/Sep/12 Resolved: 20/Sep/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Mahmoud Hanafi | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2.1.2 server |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 6368 |
| Description |
|
see attached console logs. MDS went through recover didn't finish it restarted recovery. At the end of send recovery cycle it crash crash is at line 1252 |
| Comments |
| Comment by Peter Jones [ 05/Jul/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 05/Jul/12 ] |
|
The log shows it BUG on pid 5450: BUG: scheduling while atomic: mdt_33/5450/0xffff8800 The backtrace for 5450 is as below: mdt_33 D 0000000000000246 0 5450 2 0x3005ff40 ffff8804432ca750 0000000000000018 ffff88061feaa378 0000000000000246 ffff8804432ca780 ffffffffa00023ac ffff880621185400 ffff88042f278690 ffff8804432ca7e0 ffff88061ffe8f20 ffff8804432ca7b0 ffffffff81254f74 Call Trace: [<ffffffffa00023ac>] ? dm_dispatch_request+0x3c/0x70 [dm_mod] [<ffffffff81254f74>] ? blk_unplug+0x34/0x70 [<ffffffff811a7c90>] ? sync_buffer+0x0/0x50 [<ffffffff81520d83>] ? printk+0x41/0x46 [<ffffffff81056681>] ? __schedule_bug+0x41/0x70 [<ffffffff81521978>] ? thread_return+0x638/0x760 [<ffffffffa000420c>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod] [<ffffffff8109ac19>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff811a7c90>] ? sync_buffer+0x0/0x50 [<ffffffff81521b13>] ? io_schedule+0x73/0xc0 [<ffffffff811a7cd0>] ? sync_buffer+0x40/0x50 [<ffffffff815224cf>] ? __wait_on_bit+0x5f/0x90 [<ffffffff811a7c90>] ? sync_buffer+0x0/0x50 [<ffffffff81522578>] ? out_of_line_wait_on_bit+0x78/0x90 [<ffffffff81090030>] ? wake_bit_function+0x0/0x50 [<ffffffff811a7c86>] ? __wait_on_buffer+0x26/0x30 [<ffffffffa0abb63c>] ? ldiskfs_mb_init_cache+0x24c/0xa30 [ldiskfs] [<ffffffffa0abbf3e>] ? ldiskfs_mb_init_group+0x11e/0x210 [ldiskfs] [<ffffffffa0abc0fd>] ? ldiskfs_mb_good_group+0xcd/0x110 [ldiskfs] [<ffffffffa0abf0bb>] ? ldiskfs_mb_regular_allocator+0x19b/0x410 [ldiskfs] [<ffffffff8152286e>] ? mutex_lock+0x1e/0x50 [<ffffffffa0abf762>] ? ldiskfs_mb_new_blocks+0x432/0x660 [ldiskfs] [<ffffffffa0aa61fe>] ? ldiskfs_ext_find_extent+0x2ce/0x330 [ldiskfs] [<ffffffffa0aa93b4>] ? ldiskfs_ext_get_blocks+0x1114/0x1a10 [ldiskfs] [<ffffffffa0a7f541>] ? __jbd2_journal_file_buffer+0xd1/0x220 [jbd2] [<ffffffffa0a8075f>] ? jbd2_journal_dirty_metadata+0xff/0x150 [jbd2] [<ffffffffa0aa3f3b>] ? __ldiskfs_handle_dirty_metadata+0x7b/0x100 [ldiskfs] [<ffffffffa0ab1665>] ? ldiskfs_get_blocks+0xf5/0x2a0 [ldiskfs] [<ffffffffa0ad8048>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] [<ffffffffa0ab62b9>] ? ldiskfs_getblk+0x79/0x1f0 [ldiskfs] [<ffffffffa0ab6448>] ? ldiskfs_bread+0x18/0x80 [ldiskfs] [<ffffffffa0b5e2a7>] ? fsfilt_ldiskfs_write_handle+0x147/0x340 [fsfilt_ldiskfs] [<ffffffffa0b5e55c>] ? fsfilt_ldiskfs_write_record+0xbc/0x1d0 [fsfilt_ldiskfs] [<ffffffffa06382ec>] ? llog_lvfs_write_blob+0x2bc/0x460 [obdclass] [<ffffffffa06338b8>] ? llog_init_handle+0xa18/0xa70 [obdclass] [<ffffffffa0639b5a>] ? llog_lvfs_write_rec+0x40a/0xf00 [obdclass] [<ffffffffa0636a65>] ? llog_cat_add_rec+0xf5/0x840 [obdclass] [<ffffffffa063d306>] ? llog_obd_origin_add+0x56/0x190 [obdclass] [<ffffffffa063d4e1>] ? llog_add+0xa1/0x3c0 [obdclass] [<ffffffff8115c66a>] ? kmem_getpages+0xba/0x170 [<ffffffffa09610cc>] ? lov_llog_origin_add+0xcc/0x5d0 [lov] [<ffffffffa063d4e1>] ? llog_add+0xa1/0x3c0 [obdclass] [<ffffffffa0b7449e>] ? mds_llog_origin_add+0xae/0x2e0 [mds] [<ffffffffa0ab0651>] ? __ldiskfs_get_inode_loc+0xf1/0x3b0 [ldiskfs] [<ffffffffa0995d0b>] ? lov_tgt_maxbytes+0x5b/0xb0 [lov] [<ffffffffa063d4e1>] ? llog_add+0xa1/0x3c0 [obdclass] [<ffffffffa0b74b12>] ? mds_llog_add_unlink+0x162/0x520 [mds] [<ffffffffa0b75206>] ? mds_log_op_unlink+0x196/0x9a0 [mds] [<ffffffffa0b9d7de>] ? mdd_unlink_log+0x4e/0x100 [mdd] [<ffffffffa0b973ab>] ? mdd_attr_get_internal+0x7ab/0xb10 [mdd] [<ffffffffa0b91a8e>] ? mdd_object_kill+0x14e/0x1b0 [mdd] [<ffffffffa0bab8ce>] ? mdd_finish_unlink+0x20e/0x2c0 [mdd] [<ffffffffa0baa780>] ? __mdd_ref_del+0x40/0xc0 [mdd] [<ffffffffa0bb704c>] ? mdd_rename+0x1ffc/0x2240 [mdd] [<ffffffffa0b96dcb>] ? mdd_attr_get_internal+0x1cb/0xb10 [mdd] [<ffffffffa0589caf>] ? cfs_hash_bd_from_key+0x3f/0xc0 [libcfs] [<ffffffffa0c7f219>] ? cmm_mode_get+0x109/0x320 [cmm] [<ffffffffa0c7fd3a>] ? cml_rename+0x33a/0xbb0 [cmm] [<ffffffffa058a337>] ? cfs_hash_bd_get+0x37/0x90 [libcfs] [<ffffffffa0c7f49d>] ? cmm_is_subdir+0x6d/0x2f0 [cmm] [<ffffffffa067d8e6>] ? lu_object_put+0x86/0x210 [obdclass] [<ffffffffa0c09426>] ? mdt_reint_rename+0x1fa6/0x2400 [mdt] [<ffffffffa05909db>] ? upcall_cache_get_entry+0x28b/0xa14 [libcfs] [<ffffffffa0c0197f>] ? mdt_rename_unpack+0x46f/0x6c0 [mdt] [<ffffffffa0bb96c6>] ? md_ucred+0x26/0x60 [mdd] [<ffffffffa0c01c0f>] ? mdt_reint_rec+0x3f/0x100 [mdt] [<ffffffffa07726e4>] ? lustre_msg_get_flags+0x34/0xa0 [ptlrpc] [<ffffffffa0bfa004>] ? mdt_reint_internal+0x6d4/0x9f0 [mdt] [<ffffffffa0bef9f6>] ? mdt_reint_opcode+0x96/0x160 [mdt] [<ffffffffa0bfa36c>] ? mdt_reint+0x4c/0x120 [mdt] [<ffffffffa07721b8>] ? lustre_msg_check_version+0xc8/0xe0 [ptlrpc] [<ffffffffa0bedc65>] ? mdt_handle_common+0x8d5/0x1810 [mdt] [<ffffffffa076fe44>] ? lustre_msg_get_opc+0x94/0x100 [ptlrpc] [<ffffffffa0beec75>] ? mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa0780b89>] ? ptlrpc_main+0xbb9/0x1990 [ptlrpc] [<ffffffffa077ffd0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [<ffffffffa077ffd0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] [<ffffffffa077ffd0>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] [<ffffffff8100c140>] ? child_rip+0x0/0x20 The backtrace looks normal (though the stack is a bit deep), but it reported schedule while atomic. Alex, could you give some hint? |
| Comment by Jay Lan (Inactive) [ 08/Aug/12 ] |
|
We hit this bug again last night. Here are what shown on console: Lustre: MGS: haven't heard from client af2ee3e4-cf7b-2e45-4650-a3f788979f14 (at 10.151.6.3@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff8811f27af000, cur 1344393811 expire 1344393661 last 1344393584^M The crash analysis showed the stack trace of the task that caused the crash: — <NMI exception stack> — |
| Comment by Jay Lan (Inactive) [ 08/Aug/12 ] |
|
Our lustre git source is at The original report was built on tag 2.1.2-1nasS. The crash last night was built on tag 2.1.2-2nasS. |
| Comment by Jay Lan (Inactive) [ 08/Aug/12 ] |
|
Can we modify the summary to "BUG: scheduling while atomic"? The current summary does not describe the crash of last night. |
| Comment by Mahmoud Hanafi [ 09/Aug/12 ] |
|
We hit this again. Can we increase this to critical |
| Comment by Lai Siyao [ 09/Aug/12 ] |
|
There is a log message reporting -ENOSPC: LustreError: 9602:0:(llog_cat.c:298:llog_cat_add_rec()) llog_write_rec -28: lh=ffff881032b9d8c0^M Could you verify it's true on MDS? |
| Comment by Jay Lan (Inactive) [ 09/Aug/12 ] |
|
I saw the -ENOSPC LustreError immediately before the crash. However, when I analyzed the vmcore with crash, "kmem -i" showed MDS had plenty of memory for the system: PAGES TOTAL PERCENTAGE TOTAL SWAP 500013 1.9 GB ---- Did we run out of some preallocated memory? Hmmm... |
| Comment by Jay Lan (Inactive) [ 09/Aug/12 ] |
|
On the other hand, -ENOSPC (or -28) was not reported in the crash last night (the 2nd crash in two consecutive nights.) So, it should not be the reason. |
| Comment by Oleg Drokin [ 09/Aug/12 ] |
|
The problem at hand seems to be stack overflow, so patches from lu-969 should help |
| Comment by Oleg Drokin [ 09/Aug/12 ] |
|
so the three patches you need are: |
| Comment by Peter Jones [ 09/Aug/12 ] |
|
Following up on the action in the meeting earlier. I spoke to Oleg about whether it was possible to predict when stack overflows would occur and he confirmed that it was a combination of independent factors that resulted in a stack overflow so it was not possible to anticipate ahead of time. |
| Comment by Jay Lan (Inactive) [ 09/Aug/12 ] |
|
Oleg, how do I decide it was a stack overflow? PID: 9115 TASK: ffff8811da902a80 CPU: 9 COMMAND: "mdt_268" "bt" command showed the stack started at and, struct task_struct { The "RSP: ffff8811da8aa750" when interrupted, so I assumed it did not overflow And, when the PageFault EXCEPTION hit the IRQ stack, How do I determine I have had a stack overflow? Is there a command in "crash" Thanks! |
| Comment by Oleg Drokin [ 09/Aug/12 ] |
|
The stacktrace is pretty long, as you can see, and that's not all of it, there was more as we dipped into device driver, and that's what stepped on the struct thread_info enough to just overwrite the flags but not spill over too much into other spaces. also note the crash happened in task_rq_lock in the interrupt after the warnign was already printed by the so happened interrupt. the task rq lock happened to be garbage already which further reinforces this theory. There is no command in crash to catch a stack overflow that managed not to crash while in overflow and return back inside of the stack but merely corrupted thread info before retracting back, but you can dump struct_thread_info content and see a lot of it is garbage, I guess. |
| Comment by Jay Lan (Inactive) [ 13/Aug/12 ] |
|
Since the 3 patches Oleg suggested on Aug 9 were quite extensive, would it be OK if we just apply the patches to MDS only? Would the patches require OSS'es also run the same set of patches? |
| Comment by Oleg Drokin [ 16/Aug/12 ] |
|
You can apply them to MDS only if that's where you experience the problem, but they would be beneficial everywhere, we have seen stack overflows on clients too, usually when reexporting nfs. BTW, I now have combined patch instead of the three I referenced, if that makes your life easier: http://review.whamcloud.com/3623 |
| Comment by Jay Lan (Inactive) [ 20/Sep/12 ] |
|
We have run with patch of |
| Comment by Peter Jones [ 20/Sep/12 ] |
|
duplicate of |