Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12827

sanity: test 900 read-only filesystem during cleanup: ldiskfs_xattr_block_get() bad block

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2161d6cc-de2e-11e9-add9-52540065bddc

      Seen in the test log a large number of errors during cleanup:

      rm: cannot remove '/mnt/lustre/d56w.sanity/dir1/link164': Read-only file system
      

      Seen on the MDS console log:

      LDISKFS-fs error (device dm-3): ldiskfs_xattr_block_get:405: inode #1267: comm mdt00_002: bad block 152837
      Aborting journal on device dm-3-8.
      LustreError: 993:0:(osd_handler.c:1708:osd_trans_commit_cb()) transaction @0xffff8ab195065900 commit error: 2
      LDISKFS-fs (dm-3): Remounting filesystem read-only
      LustreError: 1012:0:(mdt_handler.c:1224:mdt_getattr_internal()) lustre-MDT0000: getattr error for [0x2000013a1:0xd880:0x0]: rc = -5
      LustreError: 1042:0:(llog_cat.c:762:llog_cat_cancel_records()) lustre-OST0005-osc-MDT0000: fail to cancel 1 of 1 llog-records: rc = -30
      LustreError: 1042:0:(osp_sync.c:1065:osp_sync_process_committed()) lustre-OST0005-osc-MDT0000: can't cancel record: -30
      

      Attachments

        Issue Links

          Activity

            [LU-12827] sanity: test 900 read-only filesystem during cleanup: ldiskfs_xattr_block_get() bad block

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36484
            Subject: LU-12827 ldiskfs: fix ldiskfs_xattr_inode_iget error checking
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 11667154eee1ea60c2179fa5ba152a23825cc2a1

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36484 Subject: LU-12827 ldiskfs: fix ldiskfs_xattr_inode_iget error checking Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 11667154eee1ea60c2179fa5ba152a23825cc2a1

            Dongyang, could you please add a backported patch for the problem you found.

            adilger Andreas Dilger added a comment - Dongyang, could you please add a backported patch for the problem you found.
            dongyang Dongyang Li added a comment -

            The debug patch triggered another error with metadata_csum:

            https://testing.whamcloud.com/test_sessions/2985fd8c-c1cb-4c8a-887b-bbb051056eb3

            and it crashed this time:

            [ 3755.461295] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3755.793142] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3756.134861] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3756.474540] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3756.820665] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3757.166268] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a
            [ 3757.504059] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
            [ 3757.826509] Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -A -C -t namespace
            [ 3758.000591] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm OI_scrub: checksum invalid
            [ 3758.002430] Aborting journal on device dm-3-8.
            [ 3758.003354] LDISKFS-fs (dm-3): Remounting filesystem read-only
            [ 3758.003511] LustreError: 7527:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff8e8c9c06c200 commit error: 2
            [ 3758.015734] LustreError: 18534:0:(scrub.c:236:scrub_file_store()) lustre-MDT0000: store scrub file: rc = -30
            [ 3758.175022] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n 			mdd.lustre-MDT0000.lfsck_namespace |
            [ 3758.175022] 			awk '/^status/ { print $2 }'
            [ 3758.210580] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck: checksum invalid
            [ 3758.213788] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck_namespace: checksum invalid
            [ 3758.216336] BUG: unable to handle kernel NULL pointer dereference at 000000000000001b
            [ 3758.217805] IP: [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
            [ 3758.218855] PGD 0 
            [ 3758.219258] Oops: 0000 [#1] SMP 
            [ 3758.219944] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel parport_pc lrw gf128mul glue_helper virtio_balloon ablk_helper parport cryptd joydev pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw virtio_pci libata virtio_ring virtio 8139cp mii floppy [last unloaded: llog_test]
            [ 3758.236589] CPU: 1 PID: 18535 Comm: lfsck_namespace Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.x86_64 #1
            [ 3758.238786] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
            [ 3758.239778] task: ffff8e8c9d5be180 ti: ffff8e8c9ac64000 task.ti: ffff8e8c9ac64000
            [ 3758.241087] RIP: 0010:[<ffffffffb6e61d48>]  [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
            [ 3758.242505] RSP: 0018:ffff8e8c9ac679e8  EFLAGS: 00010246
            [ 3758.243450] RAX: 0000000000000000 RBX: fffffffffffffffb RCX: 0000000000000000
            [ 3758.244686] RDX: 0000000000000002 RSI: ffff8e8ca04cdec0 RDI: fffffffffffffffb
            [ 3758.245919] RBP: ffff8e8c9ac67a20 R08: ffff8e8c9accb898 R09: ffff8e8c9d5be180
            [ 3758.247150] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e8ca066d1a8
            [ 3758.248392] R13: 000000000000055a R14: ffff8e8c9ac67a48 R15: ffff8e8ca066d1a8
            [ 3758.249626] FS:  0000000000000000(0000) GS:ffff8e8cbfd00000(0000) knlGS:0000000000000000
            [ 3758.251037] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
            [ 3758.252024] CR2: 000000000000001b CR3: 00000000780a8000 CR4: 00000000000606e0
            [ 3758.253294] Call Trace:
            [ 3758.253823]  [<ffffffffc09a8ec3>] ? ldiskfs_xattr_inode_iget+0x143/0x180 [ldiskfs]
            [ 3758.255145]  [<ffffffffc09a8f35>] ldiskfs_xattr_inode_get+0x35/0x130 [ldiskfs]
            [ 3758.256424]  [<ffffffffc09a92a0>] ldiskfs_xattr_ibody_get+0x160/0x1a0 [ldiskfs]
            [ 3758.257699]  [<ffffffffc09a934d>] ldiskfs_xattr_get+0x6d/0x360 [ldiskfs]
            [ 3758.258896]  [<ffffffffb6ccfeb4>] ? __wake_up+0x44/0x50
            [ 3758.259854]  [<ffffffffc09e805a>] ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs]
            [ 3758.261127]  [<ffffffffb6e6a522>] generic_getxattr+0x52/0x70
            [ 3758.262199]  [<ffffffffc1075bae>] osd_xattr_get+0x17e/0x7e0 [osd_ldiskfs]
            [ 3758.263410]  [<ffffffffc097be00>] ? fld_server_lookup+0x20/0x320 [fld]
            [ 3758.264562]  [<ffffffffb6cddd9e>] ? account_entity_dequeue+0xae/0xd0
            [ 3758.265695]  [<ffffffffc116aa8f>] __lfsck_links_read+0x7f/0x2d0 [lfsck]
            [ 3758.266859]  [<ffffffffc117707b>] lfsck_namespace_assistant_handler_p1+0x3fb/0x2090 [lfsck]
            [ 3758.268348]  [<ffffffffb6c2a59e>] ? __switch_to+0xce/0x580
            [ 3758.269326]  [<ffffffffb6cd1a00>] ? rq_attach_root+0x90/0x110
            [ 3758.270322]  [<ffffffffc115a0bf>] lfsck_assistant_engine+0x3cf/0x20c0 [lfsck]
            [ 3758.271576]  [<ffffffffb6cd7c40>] ? wake_up_state+0x20/0x20
            [ 3758.272572]  [<ffffffffc1159cf0>] ? lfsck_master_engine+0x1370/0x1370 [lfsck]
            [ 3758.273818]  [<ffffffffb6cc2e81>] kthread+0xd1/0xe0
            [ 3758.274695]  [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
            [ 3758.275787]  [<ffffffffb7377c37>] ret_from_fork_nospec_begin+0x21/0x21
            [ 3758.276929]  [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40
            [ 3758.277994] Code: 00 00 00 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 31 c0 <48> 81 7f 20 40 e4 42 b7 48 89 e5 5d 0f 94 c0 c3 0f 1f 84 00 00 
            [ 3758.283591] RIP  [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20
            [ 3758.284617]  RSP <ffff8e8c9ac679e8>
            [ 3758.285244] CR2: 000000000000001b
            

            Looks like this time it's an xattr inode failed checksum verify, and it crashed as a result:

            struct inode *ext4_xattr_inode_iget(struct inode *parent,
                                                unsigned long ea_ino, int *err)
            {
                    struct inode *ea_inode = NULL;        ea_inode = ext4_iget(parent->i_sb, ea_ino);
                    if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) {
                            int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0;
                            ext4_error(parent->i_sb, "error while reading EA inode %lu "
                                       "/ %d %d", ea_ino, rc, is_bad_inode(ea_inode));
                            *err = rc != 0 ? rc : -EIO;
                            return NULL;
                    }
            

            the ea_inode is an error value, the second call to is_bad_inode will crash.

            Looks like we need upstream commit bab79b04999ccbbf59f1693d0783cd6ae27e4278 for rhel7 kernels to stop it crashing after the checksum verify failure from ext4_iget...

            dongyang Dongyang Li added a comment - The debug patch triggered another error with metadata_csum: https://testing.whamcloud.com/test_sessions/2985fd8c-c1cb-4c8a-887b-bbb051056eb3 and it crashed this time: [ 3755.461295] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3755.793142] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3756.134861] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3756.474540] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3756.820665] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3757.166268] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x8000019a [ 3757.504059] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0 [ 3757.826509] Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -A -C -t namespace [ 3758.000591] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm OI_scrub: checksum invalid [ 3758.002430] Aborting journal on device dm-3-8. [ 3758.003354] LDISKFS-fs (dm-3): Remounting filesystem read-only [ 3758.003511] LustreError: 7527:0:(osd_handler.c:1727:osd_trans_commit_cb()) transaction @0xffff8e8c9c06c200 commit error: 2 [ 3758.015734] LustreError: 18534:0:(scrub.c:236:scrub_file_store()) lustre-MDT0000: store scrub file: rc = -30 [ 3758.175022] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace | [ 3758.175022] awk '/^status/ { print $2 }' [ 3758.210580] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck: checksum invalid [ 3758.213788] LDISKFS-fs error (device dm-3): ldiskfs_iget:4459: inode #1370: comm lfsck_namespace: checksum invalid [ 3758.216336] BUG: unable to handle kernel NULL pointer dereference at 000000000000001b [ 3758.217805] IP: [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20 [ 3758.218855] PGD 0 [ 3758.219258] Oops: 0000 [#1] SMP [ 3758.219944] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel parport_pc lrw gf128mul glue_helper virtio_balloon ablk_helper parport cryptd joydev pcspkr i2c_piix4 ip_tables ext4 mbcache jbd2 ata_generic pata_acpi virtio_blk 8139too crct10dif_pclmul crct10dif_common ata_piix crc32c_intel serio_raw virtio_pci libata virtio_ring virtio 8139cp mii floppy [last unloaded: llog_test] [ 3758.236589] CPU: 1 PID: 18535 Comm: lfsck_namespace Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.x86_64 #1 [ 3758.238786] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 3758.239778] task: ffff8e8c9d5be180 ti: ffff8e8c9ac64000 task.ti: ffff8e8c9ac64000 [ 3758.241087] RIP: 0010:[<ffffffffb6e61d48>] [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20 [ 3758.242505] RSP: 0018:ffff8e8c9ac679e8 EFLAGS: 00010246 [ 3758.243450] RAX: 0000000000000000 RBX: fffffffffffffffb RCX: 0000000000000000 [ 3758.244686] RDX: 0000000000000002 RSI: ffff8e8ca04cdec0 RDI: fffffffffffffffb [ 3758.245919] RBP: ffff8e8c9ac67a20 R08: ffff8e8c9accb898 R09: ffff8e8c9d5be180 [ 3758.247150] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8e8ca066d1a8 [ 3758.248392] R13: 000000000000055a R14: ffff8e8c9ac67a48 R15: ffff8e8ca066d1a8 [ 3758.249626] FS: 0000000000000000(0000) GS:ffff8e8cbfd00000(0000) knlGS:0000000000000000 [ 3758.251037] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3758.252024] CR2: 000000000000001b CR3: 00000000780a8000 CR4: 00000000000606e0 [ 3758.253294] Call Trace: [ 3758.253823] [<ffffffffc09a8ec3>] ? ldiskfs_xattr_inode_iget+0x143/0x180 [ldiskfs] [ 3758.255145] [<ffffffffc09a8f35>] ldiskfs_xattr_inode_get+0x35/0x130 [ldiskfs] [ 3758.256424] [<ffffffffc09a92a0>] ldiskfs_xattr_ibody_get+0x160/0x1a0 [ldiskfs] [ 3758.257699] [<ffffffffc09a934d>] ldiskfs_xattr_get+0x6d/0x360 [ldiskfs] [ 3758.258896] [<ffffffffb6ccfeb4>] ? __wake_up+0x44/0x50 [ 3758.259854] [<ffffffffc09e805a>] ldiskfs_xattr_trusted_get+0x2a/0x30 [ldiskfs] [ 3758.261127] [<ffffffffb6e6a522>] generic_getxattr+0x52/0x70 [ 3758.262199] [<ffffffffc1075bae>] osd_xattr_get+0x17e/0x7e0 [osd_ldiskfs] [ 3758.263410] [<ffffffffc097be00>] ? fld_server_lookup+0x20/0x320 [fld] [ 3758.264562] [<ffffffffb6cddd9e>] ? account_entity_dequeue+0xae/0xd0 [ 3758.265695] [<ffffffffc116aa8f>] __lfsck_links_read+0x7f/0x2d0 [lfsck] [ 3758.266859] [<ffffffffc117707b>] lfsck_namespace_assistant_handler_p1+0x3fb/0x2090 [lfsck] [ 3758.268348] [<ffffffffb6c2a59e>] ? __switch_to+0xce/0x580 [ 3758.269326] [<ffffffffb6cd1a00>] ? rq_attach_root+0x90/0x110 [ 3758.270322] [<ffffffffc115a0bf>] lfsck_assistant_engine+0x3cf/0x20c0 [lfsck] [ 3758.271576] [<ffffffffb6cd7c40>] ? wake_up_state+0x20/0x20 [ 3758.272572] [<ffffffffc1159cf0>] ? lfsck_master_engine+0x1370/0x1370 [lfsck] [ 3758.273818] [<ffffffffb6cc2e81>] kthread+0xd1/0xe0 [ 3758.274695] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40 [ 3758.275787] [<ffffffffb7377c37>] ret_from_fork_nospec_begin+0x21/0x21 [ 3758.276929] [<ffffffffb6cc2db0>] ? insert_kthread_work+0x40/0x40 [ 3758.277994] Code: 00 00 00 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 b8 fb ff ff ff 48 89 e5 5d c3 66 66 66 66 90 55 31 c0 <48> 81 7f 20 40 e4 42 b7 48 89 e5 5d 0f 94 c0 c3 0f 1f 84 00 00 [ 3758.283591] RIP [<ffffffffb6e61d48>] is_bad_inode+0x8/0x20 [ 3758.284617] RSP <ffff8e8c9ac679e8> [ 3758.285244] CR2: 000000000000001b Looks like this time it's an xattr inode failed checksum verify, and it crashed as a result: struct inode *ext4_xattr_inode_iget(struct inode *parent, unsigned long ea_ino, int *err) { struct inode *ea_inode = NULL; ea_inode = ext4_iget(parent->i_sb, ea_ino); if (IS_ERR(ea_inode) || is_bad_inode(ea_inode)) { int rc = IS_ERR(ea_inode) ? PTR_ERR(ea_inode) : 0; ext4_error(parent->i_sb, "error while reading EA inode %lu " "/ %d %d" , ea_ino, rc, is_bad_inode(ea_inode)); *err = rc != 0 ? rc : -EIO; return NULL; } the ea_inode is an error value, the second call to is_bad_inode will crash. Looks like we need upstream commit bab79b04999ccbbf59f1693d0783cd6ae27e4278 for rhel7 kernels to stop it crashing after the checksum verify failure from ext4_iget...

            Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36402
            Subject: LU-12827 ldiskfs: debug patch
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e07f42eb8e830b99a530da23630ed2bce6c288f7

            gerrit Gerrit Updater added a comment - Li Dongyang (dongyangli@ddn.com) uploaded a new patch: https://review.whamcloud.com/36402 Subject: LU-12827 ldiskfs: debug patch Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e07f42eb8e830b99a530da23630ed2bce6c288f7

            This appears to have been related to the patch being tested.

            adilger Andreas Dilger added a comment - This appears to have been related to the patch being tested.

            People

              dongyang Dongyang Li
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: