Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.3.0
    • None
    • SWL - Hyperion/LLNL RHEL6 servers and clients
    • 3
    • 6323

    Description

      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry:
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #88606751: rec_len is smaller than minimal - block=44369457offset=536(536), inode=88627711, rec_len=0, name_len=4
      Sep 16 01:28:12 hyperion-rst6 kernel: Aborting journal on device md1-8.
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_reserve_inode_write: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_new_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in ldiskfs_delete_inode: Journal has aborted
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4489:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
      Sep 16 01:28:12 hyperion-rst6 kernel: LustreError: 4186:0:(osd_handler.c:894:osd_trans_stop()) Failure in transaction hook: -30

      Disk appeared to be quite messed up with fsck -fy. Ran the data capture script from lu-1015, results attached.

      Attachments

        Activity

          [LU-1948] ldiskfs - MDS goes read-only (SWL)

          Hit again in most recent SWL test

          Oct  8 20:55:58 hyperion-rst6 kernel: Lustre: 4204:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff88012a6c9000 x1415301393993750.t4460647409 o0 NID 192.168.116.125@o2ib1
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #40370421: rec_len is smaller than minimal - block=20251025offset=504(504), inode=40380706, rec_len=0, name_len=4
          Oct  8 20:58:24 hyperion-rst6 kernel: Aborting journal on device md1-8.
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb:
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only
          Oct  8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted
          Oct  8 20:58:24 hyperion-rst6 kernel: LustreError: 4885:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          
          cliffw Cliff White (Inactive) added a comment - Hit again in most recent SWL test Oct 8 20:55:58 hyperion-rst6 kernel: Lustre: 4204:0:(service.c:2105:ptlrpc_handle_rs()) All locks stolen from rs ffff88012a6c9000 x1415301393993750.t4460647409 o0 NID 192.168.116.125@o2ib1 Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_add_entry: bad entry in directory #40370421: rec_len is smaller than minimal - block=20251025offset=504(504), inode=40380706, rec_len=0, name_len=4 Oct 8 20:58:24 hyperion-rst6 kernel: Aborting journal on device md1-8. Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1): ldiskfs_journal_start_sb: Detected aborted journal Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs (md1): Remounting filesystem read-only Oct 8 20:58:24 hyperion-rst6 kernel: LDISKFS-fs error (device md1) in iam_txn_add: Journal has aborted Oct 8 20:58:24 hyperion-rst6 kernel: LustreError: 4885:0:(osd_io.c:1014:osd_ldiskfs_write_record()) journal_get_write_access() returned error -30
          pjones Peter Jones added a comment -

          duplicate of LU-2041

          pjones Peter Jones added a comment - duplicate of LU-2041
          liang Liang Zhen (Inactive) added a comment - - edited

          sigh, I would say this is a different bug, I just found it, it's LU-1951

          liang Liang Zhen (Inactive) added a comment - - edited sigh, I would say this is a different bug, I just found it, it's LU-1951

          System crashed again w/liang's patch - dump taken
          stack is a bit messed up

          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed:
          2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs]
          2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35 node->i_nlink >  [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35 0 ) failed:
          2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35 l: LustreError:  [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:35  LBUG
          2012-09-27 21:56:35  [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:35  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:35  [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c14a>] child_rip+0xa/0x20
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:35
          2012-09-27 21:56:35 Kernel panic - not syncing: LBUG
          2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P           ---------------    2.6.32-279.5.1.el6_lustre.x86_64 #1
          2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace:
          2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168
          2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs]
          2012-09-27 21:56:35 : LBUG
          2012-09-27 21:56:35  [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd]
          2012-09-27 21:56:35  [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc]
          2012-09-27 21:56:35  [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm]
          2012-09-27 21:56:35  [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt]
          2012-09-27 21:56:35  [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt]
          2012-09-27 21:56:36  [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
          2012-09-27 21:56:36  [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff810533f3>] ? __wake_up+0x53/0x70
          2012-09-27 21:56:36  [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c14a>] ? child_rip+0xa/0x20
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc]
          2012-09-27 21:56:36  [<ffffffff8100c140>] ? child_rip+0x0/0x20
          2012-09-27 21:56:36 Initializing cgroup subsys cpuset
          2012-09-27 21:56:36 Initializing cgroup subsys cpu
          
          cliffw Cliff White (Inactive) added a comment - System crashed again w/liang's patch - dump taken stack is a bit messed up 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) ASSERTION( inode->i_nlink > 0 ) failed: 2012-09-27 21:56:35 LustreError: 5611:0:(osd_handler.c:2343:osd_object_ref_del()) LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 2012-09-27 21:56:35 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffffa0392905>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2012-09-27 21:56:35 rnel: LustreErro [<ffffffffa0392f17>] lbug_with_loc+0x47/0xb0 [libcfs] 2012-09-27 21:56:35 r: 5611:0:(osd_h [<ffffffffa0a946a1>] osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 andler.c:2343:os [<ffffffffa0efa09d>] mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 d_object_ref_del [<ffffffffa0eff715>] mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 ()) ASSERTION( i [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 node->i_nlink > [<ffffffffa08bd037>] cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 0 ) failed: 2012-09-27 21:56:35 Sep [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 27 21:56:35 hyp [<ffffffffa0f84454>] mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 erion-rst6 kerne [<ffffffffa0f81151>] mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 l: LustreError: [<ffffffffa0f7a9aa>] mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 5611:0:(osd_hand [<ffffffffa0f7acf4>] mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 ler.c:2343:osd_o [<ffffffffa0f6e802>] mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:35 bject_ref_del()) [<ffffffffa0f6f6f5>] mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:35 LBUG 2012-09-27 21:56:35 [<ffffffffa0966b3c>] ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:35 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:35 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:35 [<ffffffffa0968111>] ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:35 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:35 2012-09-27 21:56:35 Kernel panic - not syncing: LBUG 2012-09-27 21:56:35 Pid: 5611, comm: mdt00_015 Tainted: P --------------- 2.6.32-279.5.1.el6_lustre.x86_64 #1 2012-09-27 21:56:35 Sep 27 21:56:35 Call Trace: 2012-09-27 21:56:35 hyperion-rst6 ke [<ffffffff814fd58a>] ? panic+0xa0/0x168 2012-09-27 21:56:35 rnel: Kernel pan [<ffffffffa0392f6b>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 2012-09-27 21:56:35 ic - not syncing [<ffffffffa0a946a1>] ? osd_object_ref_del+0x1d1/0x210 [osd_ldiskfs] 2012-09-27 21:56:35 : LBUG 2012-09-27 21:56:35 [<ffffffffa0efa09d>] ? mdo_ref_del+0xad/0xb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa0eff715>] ? mdd_unlink+0x815/0xdb0 [mdd] 2012-09-27 21:56:35 [<ffffffffa09581e4>] ? lustre_msg_get_versions+0xa4/0x120 [ptlrpc] 2012-09-27 21:56:35 [<ffffffffa08bd037>] ? cml_unlink+0x97/0x200 [cmm] 2012-09-27 21:56:35 [<ffffffffa0f83ddf>] ? mdt_version_get_save+0x8f/0xd0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f84454>] ? mdt_reint_unlink+0x634/0x9e0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f81151>] ? mdt_reint_rec+0x41/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7a9aa>] ? mdt_reint_internal+0x50a/0x810 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f7acf4>] ? mdt_reint+0x44/0xe0 [mdt] 2012-09-27 21:56:35 [<ffffffffa0f6e802>] ? mdt_handle_common+0x922/0x1740 [mdt] 2012-09-27 21:56:36 [<ffffffffa0f6f6f5>] ? mdt_regular_handle+0x15/0x20 [mdt] 2012-09-27 21:56:36 [<ffffffffa0966b3c>] ? ptlrpc_server_handle_request+0x41c/0xe00 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa039365e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2012-09-27 21:56:36 [<ffffffffa03a513f>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] 2012-09-27 21:56:36 [<ffffffffa095df37>] ? ptlrpc_wait_event+0xa7/0x2a0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff810533f3>] ? __wake_up+0x53/0x70 2012-09-27 21:56:36 [<ffffffffa0968111>] ? ptlrpc_main+0xbf1/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c14a>] ? child_rip+0xa/0x20 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffffa0967520>] ? ptlrpc_main+0x0/0x19e0 [ptlrpc] 2012-09-27 21:56:36 [<ffffffff8100c140>] ? child_rip+0x0/0x20 2012-09-27 21:56:36 Initializing cgroup subsys cpuset 2012-09-27 21:56:36 Initializing cgroup subsys cpu

          This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.

          di.wang Di Wang (Inactive) added a comment - This problem should be a duplicate with 1976, and Fang yong already provide a fix there, close this one.

          oh, if you mean LU-1540, which has been landed on 2_3, and already included in our test rpm here.

          di.wang Di Wang (Inactive) added a comment - oh, if you mean LU-1540 , which has been landed on 2_3, and already included in our test rpm here.

          fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          di.wang Di Wang (Inactive) added a comment - fdtree does not create symlinks, which only includes mkdir, create, dd, unlink, rmdir. But SWL includes 5 tests, fdtree, simul, IOR, mirIO, mdtest. Simul definitely include create symlinks here.

          Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          liwei Li Wei (Inactive) added a comment - Isn't osd_ldiskfs_write_record() writing one-byte off the buffer limit if write_NUL is true?

          There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          adilger Andreas Dilger added a comment - There haven't been changes to mballoc, but there was a change to the symlink NUL termination recently. Does this workload create symlinks?

          Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          di.wang Di Wang (Inactive) added a comment - Add andreas to the ticket, in case there are some mballoc changes recently for ext4.

          People

            di.wang Di Wang (Inactive)
            cliffw Cliff White (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: