[LU-2102] changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC ) Created: 07/Oct/12  Updated: 19/Apr/13  Resolved: 19/Apr/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

single node with 4 or 8 cores


Attachments: File lustre-mdt1-bad.bz2    
Severity: 3
Rank (Obsolete): 4390

 Description   

Two of my nodes running sanity in a loop crashed with this overnight.
I have a crashdump from one of the occurrences if somebody needs something from there, but tell me soon while I still have the modules and vmlinux for it.

[ 4343.790166] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[ 4343.832370] Lustre: MGC192.168.10.210@tcp: Reactivating import
[ 4343.834877] Lustre: Found index 0 for lustre-MDT0000, updating log
[ 4343.841331] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
[ 4343.847226] LustreError: 20504:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
[ 4343.874879] Lustre: lustre-MDT0000: used disk, loading
[ 4343.875534] LustreError: 20552:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
[ 4344.061652] Lustre: 20552:0:(mdt_lproc.c:418:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /home/green/git/lustre-release/lustre/utils/l_getidentity
[ 4344.072933] Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo
[ 4344.073894] LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
[ 4344.080580] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC ) failed: 
[ 4344.081572] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) LBUG
[ 4344.082363] Pid: 20504, comm: mount.lustre
[ 4344.082776] 
[ 4344.082776] Call Trace:
[ 4344.083461]  [<ffffffffa0b24915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[ 4344.083980]  [<ffffffffa0b24f27>] lbug_with_loc+0x47/0xb0 [libcfs]
[ 4344.084564]  [<ffffffffa05a9d57>] changelog_user_init_cb+0x127/0x170 [mdd]
[ 4344.086120]  [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass]
[ 4344.086707]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
[ 4344.087253]  [<ffffffffa044e18e>] llog_cat_reverse_process_cb+0x17e/0x260 [obdclass]
[ 4344.088137]  [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass]
[ 4344.088699]  [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass]
[ 4344.089540]  [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass]
[ 4344.090117]  [<ffffffffa044cdd8>] llog_cat_reverse_process+0x78/0x260 [obdclass]
[ 4344.090802]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
[ 4344.091060]  [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass]
[ 4344.091301]  [<ffffffffa05af69a>] mdd_prepare+0xe2a/0x1140 [mdd]
[ 4344.091948]  [<ffffffffa0c147da>] mdt_prepare+0x5a/0x14a0 [mdt]
[ 4344.092209]  [<ffffffffa04a2ade>] server_start_targets+0x147e/0x1d90 [obdclass]
[ 4344.092599]  [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass]
[ 4344.092967]  [<ffffffffa04a4798>] lustre_fill_super+0x13a8/0x1af0 [obdclass]
[ 4344.093195]  [<ffffffff8117d060>] ? set_anon_super+0x0/0x110
[ 4344.093411]  [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass]
[ 4344.093660]  [<ffffffff8117e4cf>] get_sb_nodev+0x5f/0xa0
[ 4344.093874]  [<ffffffffa048f955>] lustre_get_sb+0x25/0x30 [obdclass]
[ 4344.094090]  [<ffffffff8117e12b>] vfs_kern_mount+0x7b/0x1b0
[ 4344.094293]  [<ffffffff8117e2d2>] do_kern_mount+0x52/0x130
[ 4344.094504]  [<ffffffff8119c992>] do_mount+0x2d2/0x8c0
[ 4344.094714]  [<ffffffff8119d010>] sys_mount+0x90/0xe0
[ 4344.094911]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
[ 4344.095118] 
[ 4344.096049] Kernel panic - not syncing: LBUG
[ 4344.096051] Pid: 20504, comm: mount.lustre Not tainted 2.6.32-debug #6
[ 4344.096053] Call Trace:
[ 4344.096059]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[ 4344.096069]  [<ffffffffa0b24f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[ 4344.096076]  [<ffffffffa05a9d57>] ? changelog_user_init_cb+0x127/0x170 [mdd]
[ 4344.096092]  [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass]
[ 4344.096097]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
[ 4344.096112]  [<ffffffffa044e18e>] ? llog_cat_reverse_process_cb+0x17e/0x260 [obdclass]
[ 4344.096127]  [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass]
[ 4344.096142]  [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass]
[ 4344.096157]  [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass]
[ 4344.096171]  [<ffffffffa044cdd8>] ? llog_cat_reverse_process+0x78/0x260 [obdclass]
[ 4344.096176]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
[ 4344.096191]  [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass]
[ 4344.096196]  [<ffffffffa05af69a>] ? mdd_prepare+0xe2a/0x1140 [mdd]
[ 4344.096205]  [<ffffffffa0c147da>] ? mdt_prepare+0x5a/0x14a0 [mdt]
[ 4344.096223]  [<ffffffffa04a2ade>] ? server_start_targets+0x147e/0x1d90 [obdclass]
[ 4344.096241]  [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass]
[ 4344.096258]  [<ffffffffa04a4798>] ? lustre_fill_super+0x13a8/0x1af0 [obdclass]
[ 4344.096260]  [<ffffffff8117d060>] ? set_anon_super+0x0/0x110
[ 4344.096276]  [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass]
[ 4344.096278]  [<ffffffff8117e4cf>] ? get_sb_nodev+0x5f/0xa0
[ 4344.096295]  [<ffffffffa048f955>] ? lustre_get_sb+0x25/0x30 [obdclass]
[ 4344.096297]  [<ffffffff8117e12b>] ? vfs_kern_mount+0x7b/0x1b0
[ 4344.096299]  [<ffffffff8117e2d2>] ? do_kern_mount+0x52/0x130
[ 4344.096301]  [<ffffffff8119c992>] ? do_mount+0x2d2/0x8c0
[ 4344.096303]  [<ffffffff8119d010>] ? sys_mount+0x90/0xe0
[ 4344.096306]  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b


 Comments   
Comment by Mikhail Pershin [ 08/Oct/12 ]

first though was this is padding record but such records are not setting bits in llog header and not processed as result. So I am going to replace ASSERT with ASSERTF() and put more debug info there at first

Comment by Oleg Drokin [ 08/Oct/12 ]

Ok, just had another crash that Mike advises is related to this one:

[21857.010264] Lustre: DEBUG MARKER: == replay-single test 52: time out lock replay (3764) == 09:25:53 (1349702753)
[21867.957058] LDISKFS-fs (loop0): recovery complete
[21867.958976] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
[21867.985231] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff88005ce79bf0 x1415264023371217/t0(0) o101->MGC192.168.10.210@tcp@0@lo:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
[21867.986762] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) Skipped 5 previous similar messages
[21867.993177] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: 
[21867.994035] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) LBUG
[21867.994525] Pid: 18458, comm: ll_mgs_0001
[21867.994842] 
[21867.994843] Call Trace:
[21867.995536]  [<ffffffffa0c6d915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[21867.996108]  [<ffffffffa0c6df27>] lbug_with_loc+0x47/0xb0 [libcfs]
[21867.996653]  [<ffffffffa0507ac5>] llog_osd_next_block+0xa55/0xa60 [obdclass]
[21867.996708]  [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc]
[21867.996754]  [<ffffffffa068845c>] llog_origin_handle_next_block+0x55c/0x780 [ptlrpc]
[21867.996766]  [<ffffffffa0a6df73>] mgs_handle+0xb13/0x11e0 [mgs]
[21867.996782]  [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[21867.996826]  [<ffffffffa067a883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[21867.996840]  [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[21867.996882]  [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[21867.996887]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[21867.996932]  [<ffffffffa067d41a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[21867.996965]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.996968]  [<ffffffff8100c14a>] child_rip+0xa/0x20
[21867.997000]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.997031]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.997034]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
[21867.997035] 
[21867.998239] Kernel panic - not syncing: LBUG
[21867.998242] Pid: 18458, comm: ll_mgs_0001 Not tainted 2.6.32-debug #6
[21867.998244] Call Trace:
[21867.998252]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
[21867.998267]  [<ffffffffa0c6df7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
[21867.998305]  [<ffffffffa0507ac5>] ? llog_osd_next_block+0xa55/0xa60 [obdclass]
[21867.998349]  [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc]
[21867.998393]  [<ffffffffa068845c>] ? llog_origin_handle_next_block+0x55c/0x780 [ptlrpc]
[21867.998404]  [<ffffffffa0a6df73>] ? mgs_handle+0xb13/0x11e0 [mgs]
[21867.998420]  [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
[21867.998467]  [<ffffffffa067a883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
[21867.998478]  [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
[21867.998511]  [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
[21867.998514]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
[21867.998546]  [<ffffffffa067d41a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
[21867.998577]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.998580]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
[21867.998611]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.998642]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
[21867.998645]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
Comment by Oleg Drokin [ 08/Oct/12 ]

Ok, I think I have a solid reproducer for the original issue.

It seems it has to do with a full OST.
Some test (Still tracking) seems to not remove a big file after itself on error which leads to a bunch of tests to fail on the way (from around test 65) to the final test 900 remount where it finally dies.

Comment by Oleg Drokin [ 08/Oct/12 ]

Ok, here's even easier case for you.
unbzip this file and try to mount it as lustre (it's mds fs image), you'll get the assertion failure right away.

Comment by Oleg Drokin [ 08/Oct/12 ]

I just confirmed that when this bad MDT image was produced, OST0 had 0 space:

/tmp/lustre-ost1      184M  184M     0 100% /mnt/ost1
/tmp/lustre-ost2      184M   55M  119M  32% /mnt/ost2

This is mounted as ldiskfs.

Apparently there's some bed interaction between test 64 and test 65j or something that I file a separate bug for, but importane of this info here is that full ost seems to cause MDS to enter this state.

Comment by Mikhail Pershin [ 09/Oct/12 ]

Fix for changelog issue:

http://review.whamcloud.com/4229

Comment by Mikhail Pershin [ 10/Oct/12 ]

merged

Generated at Sat Feb 10 01:22:23 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.