Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2102

changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC )

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.0
    • Lustre 2.4.0
    • None
    • single node with 4 or 8 cores
    • 3
    • 4390

    Description

      Two of my nodes running sanity in a loop crashed with this overnight.
      I have a crashdump from one of the occurrences if somebody needs something from there, but tell me soon while I still have the modules and vmlinux for it.

      [ 4343.790166] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
      [ 4343.832370] Lustre: MGC192.168.10.210@tcp: Reactivating import
      [ 4343.834877] Lustre: Found index 0 for lustre-MDT0000, updating log
      [ 4343.841331] Lustre: Modifying parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
      [ 4343.847226] LustreError: 20504:0:(mgc_request.c:248:do_config_log_add()) failed processing sptlrpc log: -2
      [ 4343.874879] Lustre: lustre-MDT0000: used disk, loading
      [ 4343.875534] LustreError: 20552:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
      [ 4344.061652] Lustre: 20552:0:(mdt_lproc.c:418:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /home/green/git/lustre-release/lustre/utils/l_getidentity
      [ 4344.072933] Lustre: lustre-MDT0000: Temporarily refusing client connection from 0@lo
      [ 4344.073894] LustreError: 11-0: an error occurred while communicating with 0@lo. The mds_connect operation failed with -11
      [ 4344.080580] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC ) failed: 
      [ 4344.081572] LustreError: 20504:0:(mdd_device.c:219:changelog_user_init_cb()) LBUG
      [ 4344.082363] Pid: 20504, comm: mount.lustre
      [ 4344.082776] 
      [ 4344.082776] Call Trace:
      [ 4344.083461]  [<ffffffffa0b24915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [ 4344.083980]  [<ffffffffa0b24f27>] lbug_with_loc+0x47/0xb0 [libcfs]
      [ 4344.084564]  [<ffffffffa05a9d57>] changelog_user_init_cb+0x127/0x170 [mdd]
      [ 4344.086120]  [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass]
      [ 4344.086707]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
      [ 4344.087253]  [<ffffffffa044e18e>] llog_cat_reverse_process_cb+0x17e/0x260 [obdclass]
      [ 4344.088137]  [<ffffffffa044b568>] llog_reverse_process+0x5d8/0x9c0 [obdclass]
      [ 4344.088699]  [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass]
      [ 4344.089540]  [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass]
      [ 4344.090117]  [<ffffffffa044cdd8>] llog_cat_reverse_process+0x78/0x260 [obdclass]
      [ 4344.090802]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
      [ 4344.091060]  [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass]
      [ 4344.091301]  [<ffffffffa05af69a>] mdd_prepare+0xe2a/0x1140 [mdd]
      [ 4344.091948]  [<ffffffffa0c147da>] mdt_prepare+0x5a/0x14a0 [mdt]
      [ 4344.092209]  [<ffffffffa04a2ade>] server_start_targets+0x147e/0x1d90 [obdclass]
      [ 4344.092599]  [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass]
      [ 4344.092967]  [<ffffffffa04a4798>] lustre_fill_super+0x13a8/0x1af0 [obdclass]
      [ 4344.093195]  [<ffffffff8117d060>] ? set_anon_super+0x0/0x110
      [ 4344.093411]  [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass]
      [ 4344.093660]  [<ffffffff8117e4cf>] get_sb_nodev+0x5f/0xa0
      [ 4344.093874]  [<ffffffffa048f955>] lustre_get_sb+0x25/0x30 [obdclass]
      [ 4344.094090]  [<ffffffff8117e12b>] vfs_kern_mount+0x7b/0x1b0
      [ 4344.094293]  [<ffffffff8117e2d2>] do_kern_mount+0x52/0x130
      [ 4344.094504]  [<ffffffff8119c992>] do_mount+0x2d2/0x8c0
      [ 4344.094714]  [<ffffffff8119d010>] sys_mount+0x90/0xe0
      [ 4344.094911]  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      [ 4344.095118] 
      [ 4344.096049] Kernel panic - not syncing: LBUG
      [ 4344.096051] Pid: 20504, comm: mount.lustre Not tainted 2.6.32-debug #6
      [ 4344.096053] Call Trace:
      [ 4344.096059]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
      [ 4344.096069]  [<ffffffffa0b24f7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
      [ 4344.096076]  [<ffffffffa05a9d57>] ? changelog_user_init_cb+0x127/0x170 [mdd]
      [ 4344.096092]  [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass]
      [ 4344.096097]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
      [ 4344.096112]  [<ffffffffa044e18e>] ? llog_cat_reverse_process_cb+0x17e/0x260 [obdclass]
      [ 4344.096127]  [<ffffffffa044b568>] ? llog_reverse_process+0x5d8/0x9c0 [obdclass]
      [ 4344.096142]  [<ffffffffa044e010>] ? llog_cat_reverse_process_cb+0x0/0x260 [obdclass]
      [ 4344.096157]  [<ffffffffa044da30>] ? cat_cancel_cb+0x0/0x5e0 [obdclass]
      [ 4344.096171]  [<ffffffffa044cdd8>] ? llog_cat_reverse_process+0x78/0x260 [obdclass]
      [ 4344.096176]  [<ffffffffa05a9c30>] ? changelog_user_init_cb+0x0/0x170 [mdd]
      [ 4344.096191]  [<ffffffffa044ca54>] ? llog_process+0x14/0x20 [obdclass]
      [ 4344.096196]  [<ffffffffa05af69a>] ? mdd_prepare+0xe2a/0x1140 [mdd]
      [ 4344.096205]  [<ffffffffa0c147da>] ? mdt_prepare+0x5a/0x14a0 [mdt]
      [ 4344.096223]  [<ffffffffa04a2ade>] ? server_start_targets+0x147e/0x1d90 [obdclass]
      [ 4344.096241]  [<ffffffffa048e050>] ? class_config_llog_handler+0x0/0x1800 [obdclass]
      [ 4344.096258]  [<ffffffffa04a4798>] ? lustre_fill_super+0x13a8/0x1af0 [obdclass]
      [ 4344.096260]  [<ffffffff8117d060>] ? set_anon_super+0x0/0x110
      [ 4344.096276]  [<ffffffffa04a33f0>] ? lustre_fill_super+0x0/0x1af0 [obdclass]
      [ 4344.096278]  [<ffffffff8117e4cf>] ? get_sb_nodev+0x5f/0xa0
      [ 4344.096295]  [<ffffffffa048f955>] ? lustre_get_sb+0x25/0x30 [obdclass]
      [ 4344.096297]  [<ffffffff8117e12b>] ? vfs_kern_mount+0x7b/0x1b0
      [ 4344.096299]  [<ffffffff8117e2d2>] ? do_kern_mount+0x52/0x130
      [ 4344.096301]  [<ffffffff8119c992>] ? do_mount+0x2d2/0x8c0
      [ 4344.096303]  [<ffffffff8119d010>] ? sys_mount+0x90/0xe0
      [ 4344.096306]  [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
      

      Attachments

        Activity

          [LU-2102] changelog_user_init_cb()) ASSERTION( rec->cur_hdr.lrh_type == CHANGELOG_USER_REC )

          merged

          tappro Mikhail Pershin added a comment - merged

          Fix for changelog issue:

          http://review.whamcloud.com/4229

          tappro Mikhail Pershin added a comment - Fix for changelog issue: http://review.whamcloud.com/4229
          green Oleg Drokin added a comment -

          I just confirmed that when this bad MDT image was produced, OST0 had 0 space:

          /tmp/lustre-ost1      184M  184M     0 100% /mnt/ost1
          /tmp/lustre-ost2      184M   55M  119M  32% /mnt/ost2
          

          This is mounted as ldiskfs.

          Apparently there's some bed interaction between test 64 and test 65j or something that I file a separate bug for, but importane of this info here is that full ost seems to cause MDS to enter this state.

          green Oleg Drokin added a comment - I just confirmed that when this bad MDT image was produced, OST0 had 0 space: /tmp/lustre-ost1 184M 184M 0 100% /mnt/ost1 /tmp/lustre-ost2 184M 55M 119M 32% /mnt/ost2 This is mounted as ldiskfs. Apparently there's some bed interaction between test 64 and test 65j or something that I file a separate bug for, but importane of this info here is that full ost seems to cause MDS to enter this state.
          green Oleg Drokin added a comment -

          Ok, here's even easier case for you.
          unbzip this file and try to mount it as lustre (it's mds fs image), you'll get the assertion failure right away.

          green Oleg Drokin added a comment - Ok, here's even easier case for you. unbzip this file and try to mount it as lustre (it's mds fs image), you'll get the assertion failure right away.
          green Oleg Drokin added a comment -

          Ok, I think I have a solid reproducer for the original issue.

          It seems it has to do with a full OST.
          Some test (Still tracking) seems to not remove a big file after itself on error which leads to a bunch of tests to fail on the way (from around test 65) to the final test 900 remount where it finally dies.

          green Oleg Drokin added a comment - Ok, I think I have a solid reproducer for the original issue. It seems it has to do with a full OST. Some test (Still tracking) seems to not remove a big file after itself on error which leads to a bunch of tests to fail on the way (from around test 65) to the final test 900 remount where it finally dies.
          green Oleg Drokin added a comment -

          Ok, just had another crash that Mike advises is related to this one:

          [21857.010264] Lustre: DEBUG MARKER: == replay-single test 52: time out lock replay (3764) == 09:25:53 (1349702753)
          [21867.957058] LDISKFS-fs (loop0): recovery complete
          [21867.958976] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: 
          [21867.985231] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req@ffff88005ce79bf0 x1415264023371217/t0(0) o101->MGC192.168.10.210@tcp@0@lo:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
          [21867.986762] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) Skipped 5 previous similar messages
          [21867.993177] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: 
          [21867.994035] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) LBUG
          [21867.994525] Pid: 18458, comm: ll_mgs_0001
          [21867.994842] 
          [21867.994843] Call Trace:
          [21867.995536]  [<ffffffffa0c6d915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          [21867.996108]  [<ffffffffa0c6df27>] lbug_with_loc+0x47/0xb0 [libcfs]
          [21867.996653]  [<ffffffffa0507ac5>] llog_osd_next_block+0xa55/0xa60 [obdclass]
          [21867.996708]  [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc]
          [21867.996754]  [<ffffffffa068845c>] llog_origin_handle_next_block+0x55c/0x780 [ptlrpc]
          [21867.996766]  [<ffffffffa0a6df73>] mgs_handle+0xb13/0x11e0 [mgs]
          [21867.996782]  [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          [21867.996826]  [<ffffffffa067a883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
          [21867.996840]  [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          [21867.996882]  [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
          [21867.996887]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
          [21867.996932]  [<ffffffffa067d41a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc]
          [21867.996965]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.996968]  [<ffffffff8100c14a>] child_rip+0xa/0x20
          [21867.997000]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.997031]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.997034]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          [21867.997035] 
          [21867.998239] Kernel panic - not syncing: LBUG
          [21867.998242] Pid: 18458, comm: ll_mgs_0001 Not tainted 2.6.32-debug #6
          [21867.998244] Call Trace:
          [21867.998252]  [<ffffffff814f75e4>] ? panic+0xa0/0x168
          [21867.998267]  [<ffffffffa0c6df7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          [21867.998305]  [<ffffffffa0507ac5>] ? llog_osd_next_block+0xa55/0xa60 [obdclass]
          [21867.998349]  [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc]
          [21867.998393]  [<ffffffffa068845c>] ? llog_origin_handle_next_block+0x55c/0x780 [ptlrpc]
          [21867.998404]  [<ffffffffa0a6df73>] ? mgs_handle+0xb13/0x11e0 [mgs]
          [21867.998420]  [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          [21867.998467]  [<ffffffffa067a883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc]
          [21867.998478]  [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          [21867.998511]  [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc]
          [21867.998514]  [<ffffffff81051f73>] ? __wake_up+0x53/0x70
          [21867.998546]  [<ffffffffa067d41a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc]
          [21867.998577]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.998580]  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          [21867.998611]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.998642]  [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
          [21867.998645]  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          
          green Oleg Drokin added a comment - Ok, just had another crash that Mike advises is related to this one: [21857.010264] Lustre: DEBUG MARKER: == replay-single test 52: time out lock replay (3764) == 09:25:53 (1349702753) [21867.957058] LDISKFS-fs (loop0): recovery complete [21867.958976] LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=on. Opts: [21867.985231] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@ffff88005ce79bf0 x1415264023371217/t0(0) o101->MGC192.168.10.210@tcp@0@lo:26/25 lens 328/384 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [21867.986762] LustreError: 29518:0:(client.c:1125:ptlrpc_import_delay_req()) Skipped 5 previous similar messages [21867.993177] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) ASSERTION( last_rec->lrh_index == tail->lrt_index ) failed: [21867.994035] LustreError: 18458:0:(llog_osd.c:579:llog_osd_next_block()) LBUG [21867.994525] Pid: 18458, comm: ll_mgs_0001 [21867.994842] [21867.994843] Call Trace: [21867.995536] [<ffffffffa0c6d915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [21867.996108] [<ffffffffa0c6df27>] lbug_with_loc+0x47/0xb0 [libcfs] [21867.996653] [<ffffffffa0507ac5>] llog_osd_next_block+0xa55/0xa60 [obdclass] [21867.996708] [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc] [21867.996754] [<ffffffffa068845c>] llog_origin_handle_next_block+0x55c/0x780 [ptlrpc] [21867.996766] [<ffffffffa0a6df73>] mgs_handle+0xb13/0x11e0 [mgs] [21867.996782] [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [21867.996826] [<ffffffffa067a883>] ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [21867.996840] [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [21867.996882] [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [21867.996887] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [21867.996932] [<ffffffffa067d41a>] ptlrpc_main+0xb9a/0x1960 [ptlrpc] [21867.996965] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.996968] [<ffffffff8100c14a>] child_rip+0xa/0x20 [21867.997000] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.997031] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.997034] [<ffffffff8100c140>] ? child_rip+0x0/0x20 [21867.997035] [21867.998239] Kernel panic - not syncing: LBUG [21867.998242] Pid: 18458, comm: ll_mgs_0001 Not tainted 2.6.32-debug #6 [21867.998244] Call Trace: [21867.998252] [<ffffffff814f75e4>] ? panic+0xa0/0x168 [21867.998267] [<ffffffffa0c6df7b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] [21867.998305] [<ffffffffa0507ac5>] ? llog_osd_next_block+0xa55/0xa60 [obdclass] [21867.998349] [<ffffffffa066bdf6>] ? lustre_pack_reply_flags+0xb6/0x210 [ptlrpc] [21867.998393] [<ffffffffa068845c>] ? llog_origin_handle_next_block+0x55c/0x780 [ptlrpc] [21867.998404] [<ffffffffa0a6df73>] ? mgs_handle+0xb13/0x11e0 [mgs] [21867.998420] [<ffffffffa0c7e6d1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] [21867.998467] [<ffffffffa067a883>] ? ptlrpc_server_handle_request+0x463/0xe70 [ptlrpc] [21867.998478] [<ffffffffa0c6e66e>] ? cfs_timer_arm+0xe/0x10 [libcfs] [21867.998511] [<ffffffffa0673571>] ? ptlrpc_wait_event+0xb1/0x2a0 [ptlrpc] [21867.998514] [<ffffffff81051f73>] ? __wake_up+0x53/0x70 [21867.998546] [<ffffffffa067d41a>] ? ptlrpc_main+0xb9a/0x1960 [ptlrpc] [21867.998577] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998580] [<ffffffff8100c14a>] ? child_rip+0xa/0x20 [21867.998611] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998642] [<ffffffffa067c880>] ? ptlrpc_main+0x0/0x1960 [ptlrpc] [21867.998645] [<ffffffff8100c140>] ? child_rip+0x0/0x20

          first though was this is padding record but such records are not setting bits in llog header and not processed as result. So I am going to replace ASSERT with ASSERTF() and put more debug info there at first

          tappro Mikhail Pershin added a comment - first though was this is padding record but such records are not setting bits in llog header and not processed as result. So I am going to replace ASSERT with ASSERTF() and put more debug info there at first

          People

            tappro Mikhail Pershin
            green Oleg Drokin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: