Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0
    • None
    • SWL - Hyperion/LLNL RHEL6 servers and clients
    • 3
    • 6323

    Description

      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
      Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30

      Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.

      Attachments

        Activity

          [LU-1948] ldiskfs - MDS goes read-only (SWL)

          Hit again in most recent SWL test

          Oct  8 20:55:58 hyperion-rst6 kernel: Lustre: 4204:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff88012a6c9000 x1415301393993750.t4460647409 o0 NID 192.168.116.125@o2ib1
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #40370421: rec_len is smaller than minimal - block=20251025offset=504(504), inode=40380706, rec_len=0, name_len=4
          Oct  8 20:58:24 hyperion-rst6 kernel: Aborting journal on device md1-8.
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb:
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
          Oct  8 20:58:24 hyperion-rst6 kernel: LustreError: 4885:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          
          cliffw Cliff White (Inactive) added a comment - Hit again in most recent SWL test Oct 8 20:55:58 hyperion-rst6 kernel: Lustre: 4204:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff88012a6c9000 x1415301393993750.t4460647409 o0 NID 192.168.116.125@o2ib1 Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #40370421: rec_len is smaller than minimal - block=20251025offset=504(504), inode=40380706, rec_len=0, name_len=4 Oct 8 20:58:24 hyperion-rst6 kernel: Aborting journal on device md1-8. Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Oct 8 20:58:24 hyperion-rst6 kernel: LustreError: 4885:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          pjones Peter Jones added a comment -

          duplicate of LU-2041

          pjones Peter Jones added a comment - duplicate of LU-2041
          liang Liang Zhen (Inactive) added a comment - - edited

          sigh, I would say this is a different bug, I just found it, it's LU-1951

          liang Liang Zhen (Inactive) added a comment - - edited sigh, I would say this is a different bug, I just found it, it's LU-1951

          System crashed again w/liang's patch - dump taken
          stack is a bit messed up

          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 node->i_nlink >  [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 0 ) failed:
          2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 l: LustreError:  [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:35  LBUG
          2012-09-27 21:56:35  [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:35  [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
          2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 : LBUG
          2012-09-27 21:56:35  [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35  [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:36  [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:36 Initializing cgroup subsys cpuset
          2012-09-27 21:56:36 Initializing cgroup subsys cpu
          
          cliffw Cliff White (Inactive) added a comment - System crashed again w/liang's patch - dump taken stack is a bit messed up 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 2012-09-27 21:56:35 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] 2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 0 ) failed: 2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:35 2012-09-27 21:56:35 Kernel panic - not syncing: LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168 2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 : LBUG 2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:36 Initializing cgroup subsys cpuset 2012-09-27 21:56:36 Initializing cgroup subsys cpu
          di.wang Di Wang added a comment -

          This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.

          di.wang Di Wang added a comment - This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.
          di.wang Di Wang added a comment -

          oh, if you mean LU-1540, which has been landed on 2_3, and already included in our test rpm here.

          di.wang Di Wang added a comment - oh, if you mean LU-1540 , which has been landed on 2_3, and already included in our test rpm here.
          di.wang Di Wang added a comment -

          fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          di.wang Di Wang added a comment - fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          liwei Li Wei (Inactive) added a comment - Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          adilger Andreas Dilger added a comment - There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?
          di.wang Di Wang added a comment -

          Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          di.wang Di Wang added a comment - Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          People

            di.wang Di Wang
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: