[LU-2102] changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC ) Created: 07/Oct/12 Updated: 19/Apr/13 Resolved: 19/Apr/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oleg Drokin | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
single node with 4 or 8 cores |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4390 |
| Description |
|
Two of my nodes running sanity in a loop crashed with this overnight. [ 4343.790166] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [ 4343.832370] Lustre: MGC192.168.10.210@tcp: Reactivating import [ 4343.834877] Lustre: Found index 0 for lustre-MDT0000, updating log [ 4343.841331] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 [ 4343.847226] LustreError: 20504:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2 [ 4343.874879] Lustre: lustre-MDT0000: used disk, loading [ 4343.875534] LustreError: 20552:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context [ 4344.061652] Lustre: 20552:0:(mdt_lproc.c:418:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /home/green/git/lustre-release/lustre/utils/l_getidentity [ 4344.072933] Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo [ 4344.073894] LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11 [ 4344.080580] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC ) failed: [ 4344.081572] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) LBUG [ 4344.082363] Pid: 20504, comm: mount.lustre [ 4344.082776] [ 4344.082776] Call Trace: [ 4344.083461] [<ffffffffa0b24915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [ 4344.083980] [<ffffffffa0b24f27>] lbug_with_loc+0x47/0xb0 [libcfs] [ 4344.084564] [<ffffffffa05a9d57>] changelog_user_init_cb+0x127/0x170 [mdd] [ 4344.086120] [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass] [ 4344.086707] [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd] [ 4344.087253] [<ffffffffa044e18e>] llog_cat_reverse_process_cb+0x17e/0x260 [obdclass] [ 4344.088137] [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass] [ 4344.088699] [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass] [ 4344.089540] [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass] [ 4344.090117] [<ffffffffa044cdd8>] llog_cat_reverse_process+0x78/0x260 [obdclass] [ 4344.090802] [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd] [ 4344.091060] [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass] [ 4344.091301] [<ffffffffa05af69a>] mdd_prepare+0xe2a/0x1140 [mdd] [ 4344.091948] [<ffffffffa0c147da>] mdt_prepare+0x5a/0x14a0 [mdt] [ 4344.092209] [<ffffffffa04a2ade>] server_start_targets+0x147e/0x1d90 [obdclass] [ 4344.092599] [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass] [ 4344.092967] [<ffffffffa04a4798>] lustre_fill_super+0x13a8/0x1af0 [obdclass] [ 4344.093195] [<ffffffff8117d060>] ? set_anon_super+0x0/0x110 [ 4344.093411] [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass] [ 4344.093660] [<ffffffff8117e4cf>] get_sb_nodev+0x5f/0xa0 [ 4344.093874] [<ffffffffa048f955>] lustre_get_sb+0x25/0x30 [obdclass] [ 4344.094090] [<ffffffff8117e12b>] vfs_kern_mount+0x7b/0x1b0 [ 4344.094293] [<ffffffff8117e2d2>] do_kern_mount+0x52/0x130 [ 4344.094504] [<ffffffff8119c992>] do_mount+0x2d2/0x8c0 [ 4344.094714] [<ffffffff8119d010>] sys_mount+0x90/0xe0 [ 4344.094911] [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b [ 4344.095118] [ 4344.096049] Kernel panic - not syncing: LBUG [ 4344.096051] Pid: 20504, comm: mount.lustre Not tainted 2.6.32-debug #6 [ 4344.096053] Call Trace: [ 4344.096059] [<ffffffff814f75e4>] ? panic+0xa0/0x168 [ 4344.096069] [<ffffffffa0b24f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [ 4344.096076] [<ffffffffa05a9d57>] ? changelog_user_init_cb+0x127/0x170 [mdd] [ 4344.096092] [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass] [ 4344.096097] [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd] [ 4344.096112] [<ffffffffa044e18e>] ? llog_cat_reverse_process_cb+0x17e/0x260 [obdclass] [ 4344.096127] [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass] [ 4344.096142] [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass] [ 4344.096157] [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass] [ 4344.096171] [<ffffffffa044cdd8>] ? llog_cat_reverse_process+0x78/0x260 [obdclass] [ 4344.096176] [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd] [ 4344.096191] [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass] [ 4344.096196] [<ffffffffa05af69a>] ? mdd_prepare+0xe2a/0x1140 [mdd] [ 4344.096205] [<ffffffffa0c147da>] ? mdt_prepare+0x5a/0x14a0 [mdt] [ 4344.096223] [<ffffffffa04a2ade>] ? server_start_targets+0x147e/0x1d90 [obdclass] [ 4344.096241] [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass] [ 4344.096258] [<ffffffffa04a4798>] ? lustre_fill_super+0x13a8/0x1af0 [obdclass] [ 4344.096260] [<ffffffff8117d060>] ? set_anon_super+0x0/0x110 [ 4344.096276] [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass] [ 4344.096278] [<ffffffff8117e4cf>] ? get_sb_nodev+0x5f/0xa0 [ 4344.096295] [<ffffffffa048f955>] ? lustre_get_sb+0x25/0x30 [obdclass] [ 4344.096297] [<ffffffff8117e12b>] ? vfs_kern_mount+0x7b/0x1b0 [ 4344.096299] [<ffffffff8117e2d2>] ? do_kern_mount+0x52/0x130 [ 4344.096301] [<ffffffff8119c992>] ? do_mount+0x2d2/0x8c0 [ 4344.096303] [<ffffffff8119d010>] ? sys_mount+0x90/0xe0 [ 4344.096306] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b |
| Comments |
| Comment by Mikhail Pershin [ 08/Oct/12 ] |
|
first though was this is padding record but such records are not setting bits in llog header and not processed as result. So I am going to replace ASSERT with ASSERTF() and put more debug info there at first |
| Comment by Oleg Drokin [ 08/Oct/12 ] |
|
Ok, just had another crash that Mike advises is related to this one: [21857.010264] Lustre: DEBUG MARKER: == replay-single test 52: time out lock replay (3764) == 09:25:53 (1349702753) [21867.957058] LDISKFS-fs (loop0): recovery complete [21867.958976] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [21867.985231] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff88005ce79bf0 x1415264023371217/t0(0) o101->MGC192.168.10.210@tcp@0@lo:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [21867.986762] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) Skipped 5 previous similar messages [21867.993177] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: [21867.994035] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) LBUG [21867.994525] Pid: 18458, comm: ll_mgs_0001 [21867.994842] [21867.994843] Call Trace: [21867.995536] [<ffffffffa0c6d915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [21867.996108] [<ffffffffa0c6df27>] lbug_with_loc+0x47/0xb0 [libcfs] [21867.996653] [<ffffffffa0507ac5>] llog_osd_next_block+0xa55/0xa60 [obdclass] [21867.996708] [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc] [21867.996754] [<ffffffffa068845c>] llog_origin_handle_next_block+0x55c/0x780 [ptlrpc] [21867.996766] [<ffffffffa0a6df73>] mgs_handle+0xb13/0x11e0 [mgs] [21867.996782] [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [21867.996826] [<ffffffffa067a883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [21867.996840] [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [21867.996882] [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [21867.996887] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [21867.996932] [<ffffffffa067d41a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc] [21867.996965] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.996968] [<ffffffff8100c14a>] child_rip+0xa/0x20 [21867.997000] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.997031] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.997034] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [21867.997035] [21867.998239] Kernel panic - not syncing: LBUG [21867.998242] Pid: 18458, comm: ll_mgs_0001 Not tainted 2.6.32-debug #6 [21867.998244] Call Trace: [21867.998252] [<ffffffff814f75e4>] ? panic+0xa0/0x168 [21867.998267] [<ffffffffa0c6df7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [21867.998305] [<ffffffffa0507ac5>] ? llog_osd_next_block+0xa55/0xa60 [obdclass] [21867.998349] [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc] [21867.998393] [<ffffffffa068845c>] ? llog_origin_handle_next_block+0x55c/0x780 [ptlrpc] [21867.998404] [<ffffffffa0a6df73>] ? mgs_handle+0xb13/0x11e0 [mgs] [21867.998420] [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [21867.998467] [<ffffffffa067a883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [21867.998478] [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [21867.998511] [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [21867.998514] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [21867.998546] [<ffffffffa067d41a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [21867.998577] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998580] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [21867.998611] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998642] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998645] [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comment by Oleg Drokin [ 08/Oct/12 ] |
|
Ok, I think I have a solid reproducer for the original issue. It seems it has to do with a full OST. |
| Comment by Oleg Drokin [ 08/Oct/12 ] |
|
Ok, here's even easier case for you. |
| Comment by Oleg Drokin [ 08/Oct/12 ] |
|
I just confirmed that when this bad MDT image was produced, OST0 had 0 space: /tmp/lustre-ost1 184M 184M 0 100% /mnt/ost1 /tmp/lustre-ost2 184M 55M 119M 32% /mnt/ost2 This is mounted as ldiskfs. Apparently there's some bed interaction between test 64 and test 65j or something that I file a separate bug for, but importane of this info here is that full ost seems to cause MDS to enter this state. |
| Comment by Mikhail Pershin [ 09/Oct/12 ] |
|
Fix for changelog issue: |
| Comment by Mikhail Pershin [ 10/Oct/12 ] |
|
merged |