Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11280

sanity: test_56w: unable to handle kernel paging request in lod_qos_prep_create

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for S Buisson <sbuisson@ddn.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/299d644c-a6de-11e8-a5f2-52540065bddc

      test_56w failed because MDS crashed with following stack trace:

      [ 2736.998005] BUG: unable to handle kernel paging request at ffff9ed8c35f3714
      [ 2736.998890] IP: [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
      [ 2736.999731] PGD 5442e067 PUD 0 
      [ 2737.000131] Oops: 0000 [#1] SMP 
      [ 2737.000531] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core sunrpc dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw ppdev gf128mul glue_helper ablk_helper cryptd i2c_piix4 parport_pc joydev i2c_core pcspkr virtio_balloon parport ip_tables ext4 ata_generic mbcache pata_acpi jbd2 ata_piix virtio_blk libata 8139too crct10dif_pclmul crct10dif_common
      [ 2737.009634]  crc32c_intel floppy serio_raw virtio_pci 8139cp virtio_ring virtio mii
      [ 2737.010477] CPU: 0 PID: 4463 Comm: mdt00_000 Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
      [ 2737.011730] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 2737.012337] task: ffff9ed89ec00000 ti: ffff9ed8a5aec000 task.ti: ffff9ed8a5aec000
      [ 2737.013118] RIP: 0010:[<ffffffffc10884e3>]  [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
      [ 2737.014122] RSP: 0018:ffff9ed8a5aef5f0  EFLAGS: 00010293
      [ 2737.014671] RAX: ffff9ed8b84bfdb0 RBX: 000000005899cae8 RCX: ffff9ed85899cad8
      [ 2737.015404] RDX: 0000000000000002 RSI: 0000000000000004 RDI: ffff9ed89c1fd200
      [ 2737.016147] RBP: ffff9ed8a5aef6e0 R08: ffff9ed89e2bc550 R09: 0000000000000000
      [ 2737.016876] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000
      [ 2737.017620] R13: 0000000000000002 R14: ffff9ed8b7e00000 R15: ffff9ed8b97a1fc0
      [ 2737.018378] FS:  0000000000000000(0000) GS:ffff9ed8bfc00000(0000) knlGS:0000000000000000
      [ 2737.019224] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 2737.019832] CR2: ffff9ed8c35f3714 CR3: 0000000036372000 CR4: 00000000000606f0
      [ 2737.020590] Call Trace:
      [ 2737.020908]  [<ffffffffc0d74d70>] ? qsd_op_begin+0xb0/0x4d0 [lquota]
      [ 2737.021581]  [<ffffffffc10898c5>] lod_prepare_create+0x215/0x2e0 [lod]
      [ 2737.022273]  [<ffffffffc107b7ee>] lod_declare_striped_create+0x1ee/0x980 [lod]
      [ 2737.023039]  [<ffffffffc1089e8f>] ? lod_sub_declare_create+0xdf/0x210 [lod]
      [ 2737.023764]  [<ffffffffc107fec4>] lod_declare_create+0x204/0x590 [lod]
      [ 2737.024456]  [<ffffffffc106f412>] ? lod_striping_from_default+0x492/0x5b0 [lod]
      [ 2737.025335]  [<ffffffffc094e9d9>] ? lu_context_refill+0x19/0x50 [obdclass]
      [ 2737.026101]  [<ffffffffc10f2892>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
      [ 2737.026909]  [<ffffffffc10e21c8>] mdd_declare_create+0x48/0xc10 [mdd]
      [ 2737.027590]  [<ffffffffc10e65e9>] mdd_create+0x929/0x13f0 [mdd]
      [ 2737.028284]  [<ffffffffc0f91e37>] mdt_reint_open+0x2117/0x3160 [mdt]
      [ 2737.028973]  [<ffffffffc09634af>] ? upcall_cache_get_entry+0x3df/0x8b0 [obdclass]
      [ 2737.029767]  [<ffffffffc0f85ce3>] mdt_reint_rec+0x83/0x210 [mdt]
      [ 2737.030409]  [<ffffffffc0f651d2>] mdt_reint_internal+0x6b2/0xa80 [mdt]
      [ 2737.031100]  [<ffffffffc0f716c2>] mdt_intent_open+0x82/0x350 [mdt]
      [ 2737.031759]  [<ffffffffc092d6f9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
      [ 2737.032527]  [<ffffffffc0f6f768>] mdt_intent_policy+0x2f8/0xd10 [mdt]
      [ 2737.033224]  [<ffffffffc0f71640>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt]
      [ 2737.034120]  [<ffffffffc0b39e9e>] ldlm_lock_enqueue+0x34e/0xa50 [ptlrpc]
      [ 2737.034863]  [<ffffffffc07516ee>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
      [ 2737.035592]  [<ffffffffc0b62483>] ldlm_handle_enqueue0+0x903/0x1520 [ptlrpc]
      [ 2737.036373]  [<ffffffffc0b8a2d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      [ 2737.037224]  [<ffffffffc0be8932>] tgt_enqueue+0x62/0x210 [ptlrpc]
      [ 2737.037901]  [<ffffffffc0bf127a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [ 2737.038639]  [<ffffffffc0748ee7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [ 2737.039364]  [<ffffffffc0b9440b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [ 2737.040188]  [<ffffffff972c52ab>] ? __wake_up_common+0x5b/0x90
      [ 2737.040823]  [<ffffffffc0b97c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [ 2737.041488]  [<ffffffff972c9e50>] ? finish_task_switch+0x50/0x170
      [ 2737.042152]  [<ffffffffc0b97130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
      [ 2737.042933]  [<ffffffff972bb621>] kthread+0xd1/0xe0
      [ 2737.043459]  [<ffffffff972bb550>] ? insert_kthread_work+0x40/0x40
      [ 2737.044119]  [<ffffffff979205f7>] ret_from_fork_nospec_begin+0x21/0x21
      [ 2737.044804]  [<ffffffff972bb550>] ? insert_kthread_work+0x40/0x40
      [ 2737.045452] Code: ff 89 1c 90 8b 45 98 31 d2 0f b7 4f 24 83 c0 01 f7 f1 41 39 cd 89 55 98 7d 25 48 8b 4f 30 8b 45 98 8b 1c 81 49 8b 86 88 09 00 00 <0f> a3 58 08 19 c0 85 c0 0f 85 17 ff ff ff 41 bc ed ff ff ff f6 
      [ 2737.048776] RIP  [<ffffffffc10884e3>] lod_qos_prep_create+0x5d3/0x17a0 [lod]
      [ 2737.049546]  RSP <ffff9ed8a5aef5f0>
      [ 2737.049919] CR2: ffff9ed8c35f3714
      

      Attachments

        Issue Links

          Activity

            [LU-11280] sanity: test_56w: unable to handle kernel paging request in lod_qos_prep_create
            pjones Peter Jones added a comment -

            Great news! Then let's close this issue out as a duplicate of LU-11279

            pjones Peter Jones added a comment - Great news! Then let's close this issue out as a duplicate of LU-11279
            bobijam Zhenyu Xu added a comment - - edited

            yes, I can reproduce it with test only 27H then 56w after removing LU-11279 patch, and not hit it with LU-11279 patch in place. So I think and verified with my VM that LU-11279 fixes this issue.

            bobijam Zhenyu Xu added a comment - - edited yes, I can reproduce it with test only 27H then 56w after removing LU-11279 patch, and not hit it with LU-11279 patch in place. So I think and verified with my VM that  LU-11279 fixes this issue.

            I just retriggered https://review.whamcloud.com/#/c/33069/6 to check
            whether that patch fixed the problem.

            Because it looks strange 56w make the path to lod_alloc_ostlist() and
            what above patch tried to address might make it happen.

            wshilong Wang Shilong (Inactive) added a comment - I just retriggered https://review.whamcloud.com/#/c/33069/6 to check whether that patch fixed the problem. Because it looks strange 56w make the path to lod_alloc_ostlist() and what above patch tried to address might make it happen.

            I just retriggered https://review.whamcloud.com/#/c/33069/6 to check
            whether that patch fixed the problem.

            Because it looks strange 56w make the path to lod_alloc_ostlist() and
            what above patch tried to address might make it happen.

            wshilong Wang Shilong (Inactive) added a comment - I just retriggered https://review.whamcloud.com/#/c/33069/6 to check whether that patch fixed the problem. Because it looks strange 56w make the path to lod_alloc_ostlist() and what above patch tried to address might make it happen.

            I am wondering whether https://review.whamcloud.com/#/c/33069/ this might be cause and fix the problem.

            It deserve to run sanity_27H together with 56W to reproduce this BUG.

            wshilong Wang Shilong (Inactive) added a comment - I am wondering whether https://review.whamcloud.com/#/c/33069/ this might be cause and fix the problem. It deserve to run sanity_27H together with 56W to reproduce this BUG.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33105
            Subject: LU-11280 revert: fix setstripe for specific osts upon dir"
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 799514054df0f2c71d98e1480e31d2ccfc3b6713

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33105 Subject: LU-11280 revert: fix setstripe for specific osts upon dir" Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 799514054df0f2c71d98e1480e31d2ccfc3b6713

            It looks like patch https://review.whamcloud.com/32814 "LU-11146 lustre: fix setstripe for specific osts upon dir" is responsible for this failure. That patch failed once with this stack on 2018-08-07 before it was landed, and no other tests failed with the same stack until 2018-08-23 after it was landed.

            adilger Andreas Dilger added a comment - It looks like patch https://review.whamcloud.com/32814 " LU-11146 lustre: fix setstripe for specific osts upon dir " is responsible for this failure. That patch failed once with this stack on 2018-08-07 before it was landed, and no other tests failed with the same stack until 2018-08-23 after it was landed.
            bobijam Zhenyu Xu added a comment -

            Tried but failed to reproduce it. The memory access fault seems locates in lod_alloc_ost_list()

            ...
                            /*
                             * We've successfully declared (reserved) an object
                             */
                            lod_qos_ost_in_use(env, stripe_count, ost_idx);
                            stripe[stripe_count] = o;
                            ost_indices[stripe_count] = ost_idx;              // memory access fault here it seems.
                            stripe_count++;
                     }
            
                    RETURN(rc);
            }
            

            I haven't found the code defect here yet, the ost_indices and stripe array has been allocated enough space, and stripe_count should not beyond the arrays boundary.

            bobijam Zhenyu Xu added a comment - Tried but failed to reproduce it. The memory access fault seems locates in lod_alloc_ost_list() ... /* * We've successfully declared (reserved) an object */ lod_qos_ost_in_use(env, stripe_count, ost_idx); stripe[stripe_count] = o; ost_indices[stripe_count] = ost_idx; // memory access fault here it seems. stripe_count++; } RETURN(rc); } I haven't found the code defect here yet, the ost_indices and stripe array has been allocated enough space, and stripe_count should not beyond the arrays boundary.

            About 11% of review-ldiskfs sessions are currently failing because of this.

            adilger Andreas Dilger added a comment - About 11% of review-ldiskfs sessions are currently failing because of this.

            Hitting this fairly often in testing.

            adilger Andreas Dilger added a comment - Hitting this fairly often in testing.

            People

              bobijam Zhenyu Xu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: