Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0
    • None
    • SWL - Hyperion/LLNL RHEL6 servers and clients
    • 3
    • 6323

    Description

      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
      Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30

      Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.

      Attachments

        Activity

          [LU-1948] ldiskfs - MDS goes read-only (SWL)
          liang Liang Zhen (Inactive) added a comment - - edited

          sigh, I would say this is a different bug, I just found it, it's LU-1951

          liang Liang Zhen (Inactive) added a comment - - edited sigh, I would say this is a different bug, I just found it, it's LU-1951

          System crashed again w/liang's patch - dump taken
          stack is a bit messed up

          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 node->i_nlink >  [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 0 ) failed:
          2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 l: LustreError:  [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:35  LBUG
          2012-09-27 21:56:35  [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:35  [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
          2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 : LBUG
          2012-09-27 21:56:35  [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35  [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:36  [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:36 Initializing cgroup subsys cpuset
          2012-09-27 21:56:36 Initializing cgroup subsys cpu
          
          cliffw Cliff White (Inactive) added a comment - System crashed again w/liang's patch - dump taken stack is a bit messed up 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 2012-09-27 21:56:35 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] 2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 0 ) failed: 2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:35 2012-09-27 21:56:35 Kernel panic - not syncing: LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168 2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 : LBUG 2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:36 Initializing cgroup subsys cpuset 2012-09-27 21:56:36 Initializing cgroup subsys cpu

          This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.

          di.wang Di Wang (Inactive) added a comment - This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.

          oh, if you mean LU-1540, which has been landed on 2_3, and already included in our test rpm here.

          di.wang Di Wang (Inactive) added a comment - oh, if you mean LU-1540 , which has been landed on 2_3, and already included in our test rpm here.

          fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          di.wang Di Wang (Inactive) added a comment - fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          liwei Li Wei (Inactive) added a comment - Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          adilger Andreas Dilger added a comment - There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          di.wang Di Wang (Inactive) added a comment - Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir.

           
           \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file
          

          So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test

           
          "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16]
          2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 "
          

          though it happened after the fs turned to read-only.

          di.wang Di Wang (Inactive) added a comment - According to test result the corrupt directory comes from fdtree test. " 78644697: name L0F7, lck 0000000000000004, name_len=568931904". Hmm during fdtree test, each thread will only create files in its own directory, and only 10 files for each dir. \\srun -N $NODES -n $PROCS --wait=7200 -o fdtree.%t.out \$SWL/IO/fdtree.bash -l 3 -d 10 >& tmp_file So this problem should be unrelated with ldiskfs pdir patch. Now I suspect it might be related with mballoc, because we saw this during the test "2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: pa free mismatch: [pa ffff88031438e748] [phy 190752] [logic 608] [len 16] [free 13] [error 0] [inode 7340126] [freed 16] 2012-09-17 17:49:57 LDISKFS-fs error (device md1): ldiskfs_mb_release_inode_pa: free 16, pa_free 13 " though it happened after the fs turned to read-only.

          Okay, the MDS just 'sploded.

          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent:
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4
          2012-09-18 14:13:37 Aborting journal on device md1-8.
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only
          2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0
          2012-09-18 14:13:37
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30
          2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30
          2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904
          2012-09-18 14:13:37
          2012-09-18 14:13:37 ------------[ cut here ]------------
          2012-09-18 14:13:37 kernel BUG at /var/lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757!
          2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP
          2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port
          2012-09-18 14:13:37 CPU 7
          2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan]
          2012-09-18 14:13:37
          2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H
          
          2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>]  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0  EFLAGS: 00010246
          2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5
          2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00
          2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000
          2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0
          2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020
          2012-09-18 14:13:37 FS:  00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000
          2012-09-18 14:13:37 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0
          2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080)
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 Stack:
          2012-09-18 14:13:37  ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000
          2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004
          2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0
          2012-09-18 14:13:37 Call Trace:
          2012-09-18 14:13:37  [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs]
          2012-09-18 14:13:37  [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm]
          2012-09-18 14:13:37  [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-18 14:13:37  [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-18 14:13:37  [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
          2012-09-18 14:13:37  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-18 14:13:37  [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc]
          2012-09-18 14:13:37  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b
          2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30
          2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30
          2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b
          2012-09-18 14:13:37 RIP  [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs]
          2012-09-18 14:13:37  RSP <ffff880121e935f0>
          2012-09-18 14:13:38 Initializing cgroup subsys cpuset
          2012-09-18 14:13:38 Initializing cgroup subsys cpu
          2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86
          

          vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          cliffw Cliff White (Inactive) added a comment - Okay, the MDS just 'sploded. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #78644697: rec_len is smaller than minimal - block=39336951offset=152(152), inode=78655820, rec_len=0, name_len=4 2012-09-18 14:13:37 Aborting journal on device md1-8. 2012-09-18 14:13:37 LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs (md1): Remounting filesystem read-only 2012-09-18 14:13:37 LDISKFS-fs error (device md1): add_dirent: 78644697: lck ffff8801304ad000 off=152(78655820),inode=32, rec_len=4, name_len=0 2012-09-18 14:13:37 2012-09-18 14:13:37 LustreError: 8052:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30 2012-09-18 14:13:37 LustreError: 8052:0:(osd_handler.c:899:osd_trans_stop()) Failure to stop transaction: -30 2012-09-18 14:13:37 78644697: name L0F7, lck 0000000000000004, name_len=568931904 2012-09-18 14:13:37 2012-09-18 14:13:37 ------------[ cut here ]------------ 2012-09-18 14:13:37 kernel BUG at / var /lib/jenkins/workspace/lustre-reviews/arch/x86_64/build_type/server/distro/el6/ib_stack/inkernel/BUILD/BUILD/lustre-ldiskfs-3.3.0/ldiskfs/namei.c:1757! 2012-09-18 14:13:37 invalid opcode: 0000 [#1] SMP 2012-09-18 14:13:37 last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/0000:05:00.0/infiniband_mad/umad0/port 2012-09-18 14:13:37 CPU 7 2012-09-18 14:13:37 Modules linked in: cmm(U) osd_ldiskfs(U) mdt(U) mdd(U) mds(U) fsfilt_ldiskfs(U) exportfs mgs(U) mgc(U) ldiskfs(U) mbcache jbd2 lustre(U) lquota(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate ko2iblnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) cpufreq_ondemand acpi_cpufreq freq_table mperf ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa mlx4_ib ib_mad ib_core dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun kvm raid0 sg sr_mod cdrom sd_mod crc_t10dif dcdbas serio_raw ata_generic pata_acpi ata_piix iTCO_wdt iTCO_vendor_support mptsas mptscsih mptbase scsi_transport_sas i7core_edac edac_core ipv6 nfs lockd fscache nfs_acl auth_rpcgss sunrpc mlx4_en mlx4_core bnx2 [last unloaded: scsi_wait_scan] 2012-09-18 14:13:37 2012-09-18 14:13:37 Pid: 4898, comm: mdt03_008 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.gb4cc145.x86_64 #1 Dell Inc. PowerEdge R610/0K399H 2012-09-18 14:13:37 RIP: 0010:[<ffffffffa0db21d0>] [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP: 0018:ffff880121e935f0 EFLAGS: 00010246 2012-09-18 14:13:37 RAX: ffff880181248000 RBX: ffff880268654078 RCX: 00000000000014b5 2012-09-18 14:13:37 RDX: ffff880268654098 RSI: 0000000000000046 RDI: ffff88016ff93c00 2012-09-18 14:13:37 RBP: ffff880121e936b0 R08: 0000000000000000 R09: 0000000000000000 2012-09-18 14:13:37 R10: 0000000000000001 R11: 0000000000000000 R12: ffff880130e16ad0 2012-09-18 14:13:37 R13: 0000000000000000 R14: 0000000000000004 R15: 0000000000000020 2012-09-18 14:13:37 FS: 00002aaaab47e700(0000) GS:ffff880028260000(0000) knlGS:0000000000000000 2012-09-18 14:13:37 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b 2012-09-18 14:13:37 CR2: 00000000006d3a80 CR3: 0000000001a85000 CR4: 00000000000006e0 2012-09-18 14:13:37 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2012-09-18 14:13:37 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 2012-09-18 14:13:37 Process mdt03_008 (pid: 4898, threadinfo ffff880121e92000, task ffff88012df62080) 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 Stack: 2012-09-18 14:13:37 ffff880121e93640 0000000002583bf7 0000000000001000 0000000000000000 2012-09-18 14:13:37 <d> ffff8801abdad5c0 ffff880200000004 ffff880104b03141 0000000000000004 2012-09-18 14:13:37 <d> 00000020e2f30800 ffff8801258d9e00 ffff8801064c3288 ffff880268654fe0 2012-09-18 14:13:37 Call Trace: 2012-09-18 14:13:37 [<ffffffffa0db5c5d>] ldiskfs_add_entry+0xcd/0x500 [ldiskfs] 2012-09-18 14:13:37 [<ffffffffa1001689>] __osd_ea_add_rec+0xb9/0x190 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa100dbeb>] osd_index_ea_insert+0x21b/0x5e0 [osd_ldiskfs] 2012-09-18 14:13:37 [<ffffffffa0eee977>] __mdd_index_insert_only+0x147/0x150 [mdd] 2012-09-18 14:13:37 [<ffffffffa0eef9b1>] __mdd_index_insert+0x51/0x1f0 [mdd] 2012-09-18 14:13:37 [<ffffffffa0ef55e3>] mdd_create+0x19a3/0x20c0 [mdd] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffffa06a4637>] cml_create+0x97/0x250 [cmm] 2012-09-18 14:13:37 [<ffffffffa0f8bb9f>] mdt_reint_open+0x108f/0x18a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f75151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6e9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6ef7d>] mdt_intent_reint+0x1ed/0x500 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f6b191>] mdt_intent_policy+0x371/0x6a0 [mdt] 2012-09-18 14:13:37 [<ffffffffa0859881>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08819bf>] ldlm_handle_enqueue0+0x48f/0xf70 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa0f6b506>] mdt_enqueue+0x46/0x130 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f62802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-18 14:13:37 [<ffffffffa0f636f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-18 14:13:37 [<ffffffffa08b199d>] ptlrpc_server_handle_request+0x40d/0xea0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08a8f37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa03aa5b1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2012-09-18 14:13:37 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-18 14:13:37 [<ffffffffa08b2f89>] ptlrpc_main+0xb59/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffffa08b2430>] ? ptlrpc_main+0x0/0x1860 [ptlrpc] 2012-09-18 14:13:37 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-18 14:13:37 Code: 00 00 48 c7 c2 20 81 dd a0 4c 8b 45 a8 45 89 f1 48 c7 c6 e5 d2 dd a0 e8 9f 56 01 00 48 8b 55 c8 66 83 7a 04 00 0f 85 a9 fd ff ff <0f> 0b eb fe 49 8b 2012-09-18 14:13:37 LustreError: 4099:0:(fsfilt-ldiskfs.c:332:fsfilt_ldiskfs_start()) error starting handle for op 10 (114 credits): rc -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:414:llog_origin_handle_cancel()) fsfilt_start_log() failed: -30 2012-09-18 14:13:37 LustreError: 4099:0:(llog_server.c:453:llog_origin_handle_cancel()) Cancel 0 of 121 llog-records failed: -30 2012-09-18 14:13:37 4c 24 40 49 8b bc 24 08 01 00 00 31 c0 4c 8b 2012-09-18 14:13:37 RIP [<ffffffffa0db21d0>] add_dirent_to_buf+0x4c0/0x530 [ldiskfs] 2012-09-18 14:13:37 RSP <ffff880121e935f0> 2012-09-18 14:13:38 Initializing cgroup subsys cpuset 2012-09-18 14:13:38 Initializing cgroup subsys cpu 2012-09-18 14:13:38 Linux version 2.6.32-279.5.1.el6_lustre.gb4cc145.x86 vmcore has been captured, is on brent.whamcloud.com ~/cliffw/lu1948/vmcore-rst6-c2.gz

          I add some debug patch based on b2_3 (http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          di.wang Di Wang (Inactive) added a comment - I add some debug patch based on b2_3 ( http://review.whamcloud.com/#change,4020 only patch on el6 ldiskfs series). build http://build.whamcloud.com/job/lustre-reviews/9250/ Cliff, could you please test with this rpm. And I suspect this problem might be related with ldiskfs htree, so Liang is looking at this problem as well.

          People

            di.wang Di Wang (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: