Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9994

sanity-scrub test_5: test failed to respond and timed out

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35ecbbdc-99c2-11e7-b778-5254006e85c2.

      The sub-test test_5 failed with the following error:

      test failed to respond and timed out
      

      This failure is a panic on MDS. It has a different stack trace than LU-5449 so I'm creating a new ticket.
      Panic seen:

      20:49:39:[15096.775030] LustreError: 1321:0:(llog_cat.c:269:llog_cat_id2handle()) lustre-MDT0000-osp-MDT0002: error opening log id [0x2:0x402:0x2]:0: rc = -2
      20:49:39:[15096.776495] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
      20:49:39:[15096.777310] IP: [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
      20:49:39:[15096.778071] PGD 78ab5067 PUD 78ab4067 PMD 0 
      20:49:39:[15096.778551] Oops: 0000 [#1] SMP 
      20:49:39:[15096.778921] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel nfsd aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd i2c_piix4 nfs_acl lockd joydev pcspkr i2c_core virtio_balloon auth_rpcgss grace parport_pc parport sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 ata_piix virtio_blk libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy
      20:49:39:[15096.788032] CPU: 1 PID: 1321 Comm: lod0002_rec0000 Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.x86_64 #1
      20:49:39:[15096.789136] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      20:49:39:[15096.789694] task: ffff880067595ee0 ti: ffff8800606bc000 task.ti: ffff8800606bc000
      20:49:39:[15096.790413] RIP: 0010:[<ffffffffc07d002a>]  [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
      20:49:39:[15096.791385] RSP: 0018:ffff8800606bfb28  EFLAGS: 00010246
      20:49:39:[15096.791910] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800606bffd8
      20:49:39:[15096.792588] RDX: ffff8800606bfbb8 RSI: 0000000000000000 RDI: ffff88006797f180
      20:49:39:[15096.793279] RBP: ffff8800606bfbe0 R08: 0000000000019be0 R09: ffff88007d001a00
      20:49:39:[15096.793977] R10: ffffffffc07d14a6 R11: 000000000000000f R12: 0000000000000000
      20:49:39:[15096.794662] R13: ffff8800667ebc00 R14: ffff88006797f180 R15: 0000000000000000
      20:49:39:[15096.795352] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      20:49:39:[15096.796149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      20:49:39:[15096.796704] CR2: 0000000000000060 CR3: 0000000077c09000 CR4: 00000000000406e0
      20:49:39:[15096.797387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      20:49:39:[15096.798084] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      20:49:39:[15096.798778] Stack:
      20:49:39:[15096.798987]  ffff8800606bfe58 ffff8800606bfb90 ffffffffc05eeba7 ffff880000000010
      20:49:39:[15096.799777]  ffff8800606bfba0 ffff8800606bfb60 0000000075f08900 ffff88007a318000
      20:49:39:[15096.800574]  ffff88006797f180 0000000000000001 0000000000000001 0000000000000402
      20:49:39:[15096.801374] Call Trace:
      20:49:39:[15096.801634]  [<ffffffffc05eeba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      20:49:39:[15096.802295]  [<ffffffffc07d5ccf>] ? llog_cat_cleanup+0x15f/0x380 [obdclass]
      20:49:39:[15096.802990]  [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
      20:49:39:[15096.803639]  [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      20:49:39:[15096.804337]  [<ffffffffc07d6a3a>] llog_cat_process_cb+0x20a/0x220 [obdclass]
      20:49:39:[15096.805040]  [<ffffffffc07d0865>] llog_process_thread+0x875/0x1460 [obdclass]
      20:49:39:[15096.805752]  [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass]
      20:49:39:[15096.806488]  [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass]
      20:49:39:[15096.807184]  [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass]
      20:49:39:[15096.807936]  [<ffffffffc07d59b9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
      20:49:39:[15096.808673]  [<ffffffffc0f1ae2a>] ? lod_sub_prep_llog+0x24a/0x783 [lod]
      20:49:39:[15096.809322]  [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
      20:49:39:[15096.809976]  [<ffffffffc07d5aee>] llog_cat_process+0x2e/0x30 [obdclass]
      20:49:39:[15096.810613]  [<ffffffffc0ee6a89>] lod_sub_recovery_thread+0x439/0xc80 [lod]
      20:49:39:[15096.811300]  [<ffffffffc0ee6650>] ? lod_trans_stop+0x340/0x340 [lod]
      20:49:39:[15096.811920]  [<ffffffff810b098f>] kthread+0xcf/0xe0
      20:49:39:[15096.812393]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      20:49:39:[15096.812987]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      20:49:39:[15096.813516]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      20:49:39:[15096.814113] Code: 41 54 53 48 81 ec 90 00 00 00 4c 8b 27 48 8b 47 18 65 48 8b 34 25 28 00 00 00 48 89 75 d0 31 f6 f6 05 d2 d5 e3 ff 01 48 89 7d 88 <4d> 8b 6c 24 60 48 89 45 80 c7 45 c4 00 00 00 00 74 0d f6 05 b9 
      20:49:39:[15096.817178] RIP  [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
      20:49:39:[15096.817925]  RSP <ffff8800606bfb28>
      20:49:39:[15096.818270] CR2: 0000000000000060
      

      Info required for matching: sanity-scrub 5

      Attachments

        Issue Links

          Activity

            [LU-9994] sanity-scrub test_5: test failed to respond and timed out

            Only one of the two patches needed to land.

            adilger Andreas Dilger added a comment - Only one of the two patches needed to land.

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29370/
            Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 97fe84143da81d42770ae29298403aad80103f94

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29370/ Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling Project: fs/lustre-release Branch: master Current Patch Set: Commit: 97fe84143da81d42770ae29298403aad80103f94

            I see two proposed fixes here:
            https://review.whamcloud.com/29370
            https://review.whamcloud.com/29616

            Are both needed? If not is one preferred over the other?

            bogl Bob Glossman (Inactive) added a comment - I see two proposed fixes here: https://review.whamcloud.com/29370 https://review.whamcloud.com/29616 Are both needed? If not is one preferred over the other?

            I made the patch because most of the LU-7585 patches failed on Maloo. I am not sure whether this patch can resolve the issue or not. But let's try.

            yong.fan nasf (Inactive) added a comment - I made the patch because most of the LU-7585 patches failed on Maloo. I am not sure whether this patch can resolve the issue or not. But let's try.

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/29616
            Subject: LU-9994 obdclass: skip llog_process for empty handle
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: bed20ef1b81873a2d7991f84c7fc7a165ecd8b8c

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/29616 Subject: LU-9994 obdclass: skip llog_process for empty handle Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: bed20ef1b81873a2d7991f84c7fc7a165ecd8b8c

            Upgrading priority to Major as it appears to have some impact on auto-tests.

            bfaccini Bruno Faccini (Inactive) added a comment - Upgrading priority to Major as it appears to have some impact on auto-tests.

            BTW, at the moment, there is a running serie of 57 identical failures/timeouts of sanity-scrub/test_5 being reported in Maloo and starting on July 19th (date when "LU-9153 llog: consolidate common error checking" has landed on master !!).

            bfaccini Bruno Faccini (Inactive) added a comment - BTW, at the moment, there is a running serie of 57 identical failures/timeouts of sanity-scrub/test_5 being reported in Maloo and starting on July 19th (date when " LU-9153 llog: consolidate common error checking" has landed on master !!).

            Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29370
            Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e9c7d9a668f01b4c8db741a6a92f037e5bfa1982

            gerrit Gerrit Updater added a comment - Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29370 Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e9c7d9a668f01b4c8db741a6a92f037e5bfa1982

            Hmm, looks like patch from LU-9153 ("llog: consolidate common error checking") has introduced a regression in llog_cat_id2handle() error handling path (see error/msg just preceding the crash) by adding llog_cat_process_common() common routine additional call in sequence :

            static int llog_cat_process_common(const struct lu_env *env,
                                               struct llog_handle *cat_llh,
                                               struct llog_rec_hdr *rec,
                                               struct llog_handle **llhp)
            {
                    struct llog_logid_rec *lir = container_of(rec, typeof(*lir), lid_hdr);
                    struct llog_log_hdr *hdr;
                    int rc;
                     
                    ENTRY;
                    if (rec->lrh_type != le32_to_cpu(LLOG_LOGID_MAGIC)) {
                            rc = -EINVAL;
                            CWARN("%s: invalid record in catalog "DFID":%x: rc = %d\n",
                                  cat_llh->lgh_ctxt->loc_obd->obd_name,
                                  PFID(&cat_llh->lgh_id.lgl_oi.oi_fid),
                                  cat_llh->lgh_id.lgl_ogen, rc);
                            RETURN(rc);
                    } 
                    CDEBUG(D_HA, "processing log "DFID":%x at index %u of catalog "DFID"\n",
                           PFID(&lir->lid_id.lgl_oi.oi_fid), lir->lid_id.lgl_ogen,
                           le32_to_cpu(rec->lrh_index),
                           PFID(&cat_llh->lgh_id.lgl_oi.oi_fid));
                
                    rc = llog_cat_id2handle(env, cat_llh, llhp, &lir->lid_id);
                    if (rc) {
                            /* After a server crash, a stub of index record in catlog could
                             * be kept, because plain log destroy + catlog index record
                             * deletion are not atomic. So we end up with an index but no
                             * actual record. Destroy the index and move on. */
                            if (rc == -ENOENT || rc == -ESTALE)
                                    rc = llog_cat_cleanup(env, cat_llh, NULL,  <<<<<<<<< rc can be cleared!!!
                                                          rec->lrh_index);
                            if (rc)
                                    CWARN("%s: can't find llog handle "DFID":%x: rc = %d\n",
                                          cat_llh->lgh_ctxt->loc_obd->obd_name,
                                          PFID(&lir->lid_id.lgl_oi.oi_fid),
                                          lir->lid_id.lgl_ogen, rc);
            
                            RETURN(rc);
                    }
            .......................
            
            static int llog_cat_process_cb(const struct lu_env *env,
                                           struct llog_handle *cat_llh,
                                           struct llog_rec_hdr *rec, void *data)
            {
                    struct llog_process_data *d = data;
                    struct llog_handle *llh = NULL;
                    int rc;
            
                    ENTRY;
                    rc = llog_cat_process_common(env, cat_llh, rec, &llh);
                    if (rc) <<<<<<<<<<<<<<< no error handling??
                            GOTO(out, rc);
            
                    if (rec->lrh_index < d->lpd_startcat) {
                            /* Skip processing of the logs until startcat */
                            rc = 0;
                    } else if (d->lpd_startidx > 0) {
                            struct llog_process_cat_data cd;
                    
                            cd.lpcd_first_idx = d->lpd_startidx;
                            cd.lpcd_last_idx = 0;
                            rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data,  <<< can be called with llh==NULL 
                                                      &cd, false);
                            /* Continue processing the next log from idx 0 */
                            d->lpd_startidx = 0;
                    } else {
                            rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL
                                                      NULL, false);
                    }
                
            ............................
            
            

            Will try to cook a patch to fix this asap.

            bfaccini Bruno Faccini (Inactive) added a comment - Hmm, looks like patch from LU-9153 ("llog: consolidate common error checking") has introduced a regression in llog_cat_id2handle() error handling path (see error/msg just preceding the crash) by adding llog_cat_process_common() common routine additional call in sequence : static int llog_cat_process_common(const struct lu_env *env, struct llog_handle *cat_llh, struct llog_rec_hdr *rec, struct llog_handle **llhp) { struct llog_logid_rec *lir = container_of(rec, typeof(*lir), lid_hdr); struct llog_log_hdr *hdr; int rc; ENTRY; if (rec->lrh_type != le32_to_cpu(LLOG_LOGID_MAGIC)) { rc = -EINVAL; CWARN("%s: invalid record in catalog "DFID":%x: rc = %d\n", cat_llh->lgh_ctxt->loc_obd->obd_name, PFID(&cat_llh->lgh_id.lgl_oi.oi_fid), cat_llh->lgh_id.lgl_ogen, rc); RETURN(rc); } CDEBUG(D_HA, "processing log "DFID":%x at index %u of catalog "DFID"\n", PFID(&lir->lid_id.lgl_oi.oi_fid), lir->lid_id.lgl_ogen, le32_to_cpu(rec->lrh_index), PFID(&cat_llh->lgh_id.lgl_oi.oi_fid)); rc = llog_cat_id2handle(env, cat_llh, llhp, &lir->lid_id); if (rc) { /* After a server crash, a stub of index record in catlog could * be kept, because plain log destroy + catlog index record * deletion are not atomic. So we end up with an index but no * actual record. Destroy the index and move on. */ if (rc == -ENOENT || rc == -ESTALE) rc = llog_cat_cleanup(env, cat_llh, NULL, <<<<<<<<< rc can be cleared!!! rec->lrh_index); if (rc) CWARN("%s: can't find llog handle "DFID":%x: rc = %d\n", cat_llh->lgh_ctxt->loc_obd->obd_name, PFID(&lir->lid_id.lgl_oi.oi_fid), lir->lid_id.lgl_ogen, rc); RETURN(rc); } ....................... static int llog_cat_process_cb(const struct lu_env *env, struct llog_handle *cat_llh, struct llog_rec_hdr *rec, void *data) { struct llog_process_data *d = data; struct llog_handle *llh = NULL; int rc; ENTRY; rc = llog_cat_process_common(env, cat_llh, rec, &llh); if (rc) <<<<<<<<<<<<<<< no error handling?? GOTO(out, rc); if (rec->lrh_index < d->lpd_startcat) { /* Skip processing of the logs until startcat */ rc = 0; } else if (d->lpd_startidx > 0) { struct llog_process_cat_data cd; cd.lpcd_first_idx = d->lpd_startidx; cd.lpcd_last_idx = 0; rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL &cd, false); /* Continue processing the next log from idx 0 */ d->lpd_startidx = 0; } else { rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL NULL, false); } ............................ Will try to cook a patch to fix this asap.
            aboyko Alexander Boyko added a comment - - edited

            I guess that the last line failed, lloghandle is NULL

            static int llog_process_thread(void *arg)
            {
                    struct llog_process_info        *lpi = arg;
                    struct llog_handle              *loghandle = lpi->lpi_loghandle;
                    struct llog_log_hdr             *llh = loghandle->lgh_hdr;   <====
            
            

             

            aboyko Alexander Boyko added a comment - - edited I guess that the last line failed, lloghandle is NULL static int llog_process_thread(void *arg) {         struct llog_process_info        *lpi = arg;         struct llog_handle              *loghandle = lpi->lpi_loghandle;         struct llog_log_hdr             *llh = loghandle->lgh_hdr; <====  

            People

              bfaccini Bruno Faccini (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: