[LU-9994] sanity-scrub test_5: test failed to respond and timed out Created: 15/Sep/17  Updated: 07/Nov/17  Resolved: 07/Nov/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Major
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10113 recovery-small test_111: MDS panic af... Open
is related to LU-9153 improve FID/config name specification... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Bob Glossman <bob.glossman@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35ecbbdc-99c2-11e7-b778-5254006e85c2.

The sub-test test_5 failed with the following error:

test failed to respond and timed out

This failure is a panic on MDS. It has a different stack trace than LU-5449 so I'm creating a new ticket.
Panic seen:

20:49:39:[15096.775030] LustreError: 1321:0:(llog_cat.c:269:llog_cat_id2handle()) lustre-MDT0000-osp-MDT0002: error opening log id [0x2:0x402:0x2]:0: rc = -2
20:49:39:[15096.776495] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
20:49:39:[15096.777310] IP: [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
20:49:39:[15096.778071] PGD 78ab5067 PUD 78ab4067 PMD 0 
20:49:39:[15096.778551] Oops: 0000 [#1] SMP 
20:49:39:[15096.778921] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel nfsd aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd i2c_piix4 nfs_acl lockd joydev pcspkr i2c_core virtio_balloon auth_rpcgss grace parport_pc parport sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 ata_piix virtio_blk libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy
20:49:39:[15096.788032] CPU: 1 PID: 1321 Comm: lod0002_rec0000 Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.x86_64 #1
20:49:39:[15096.789136] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
20:49:39:[15096.789694] task: ffff880067595ee0 ti: ffff8800606bc000 task.ti: ffff8800606bc000
20:49:39:[15096.790413] RIP: 0010:[<ffffffffc07d002a>]  [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
20:49:39:[15096.791385] RSP: 0018:ffff8800606bfb28  EFLAGS: 00010246
20:49:39:[15096.791910] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800606bffd8
20:49:39:[15096.792588] RDX: ffff8800606bfbb8 RSI: 0000000000000000 RDI: ffff88006797f180
20:49:39:[15096.793279] RBP: ffff8800606bfbe0 R08: 0000000000019be0 R09: ffff88007d001a00
20:49:39:[15096.793977] R10: ffffffffc07d14a6 R11: 000000000000000f R12: 0000000000000000
20:49:39:[15096.794662] R13: ffff8800667ebc00 R14: ffff88006797f180 R15: 0000000000000000
20:49:39:[15096.795352] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
20:49:39:[15096.796149] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
20:49:39:[15096.796704] CR2: 0000000000000060 CR3: 0000000077c09000 CR4: 00000000000406e0
20:49:39:[15096.797387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
20:49:39:[15096.798084] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
20:49:39:[15096.798778] Stack:
20:49:39:[15096.798987]  ffff8800606bfe58 ffff8800606bfb90 ffffffffc05eeba7 ffff880000000010
20:49:39:[15096.799777]  ffff8800606bfba0 ffff8800606bfb60 0000000075f08900 ffff88007a318000
20:49:39:[15096.800574]  ffff88006797f180 0000000000000001 0000000000000001 0000000000000402
20:49:39:[15096.801374] Call Trace:
20:49:39:[15096.801634]  [<ffffffffc05eeba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
20:49:39:[15096.802295]  [<ffffffffc07d5ccf>] ? llog_cat_cleanup+0x15f/0x380 [obdclass]
20:49:39:[15096.802990]  [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
20:49:39:[15096.803639]  [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass]
20:49:39:[15096.804337]  [<ffffffffc07d6a3a>] llog_cat_process_cb+0x20a/0x220 [obdclass]
20:49:39:[15096.805040]  [<ffffffffc07d0865>] llog_process_thread+0x875/0x1460 [obdclass]
20:49:39:[15096.805752]  [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass]
20:49:39:[15096.806488]  [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass]
20:49:39:[15096.807184]  [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass]
20:49:39:[15096.807936]  [<ffffffffc07d59b9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
20:49:39:[15096.808673]  [<ffffffffc0f1ae2a>] ? lod_sub_prep_llog+0x24a/0x783 [lod]
20:49:39:[15096.809322]  [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod]
20:49:39:[15096.809976]  [<ffffffffc07d5aee>] llog_cat_process+0x2e/0x30 [obdclass]
20:49:39:[15096.810613]  [<ffffffffc0ee6a89>] lod_sub_recovery_thread+0x439/0xc80 [lod]
20:49:39:[15096.811300]  [<ffffffffc0ee6650>] ? lod_trans_stop+0x340/0x340 [lod]
20:49:39:[15096.811920]  [<ffffffff810b098f>] kthread+0xcf/0xe0
20:49:39:[15096.812393]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
20:49:39:[15096.812987]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
20:49:39:[15096.813516]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
20:49:39:[15096.814113] Code: 41 54 53 48 81 ec 90 00 00 00 4c 8b 27 48 8b 47 18 65 48 8b 34 25 28 00 00 00 48 89 75 d0 31 f6 f6 05 d2 d5 e3 ff 01 48 89 7d 88 <4d> 8b 6c 24 60 48 89 45 80 c7 45 c4 00 00 00 00 74 0d f6 05 b9 
20:49:39:[15096.817178] RIP  [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass]
20:49:39:[15096.817925]  RSP <ffff8800606bfb28>
20:49:39:[15096.818270] CR2: 0000000000000060

Info required for matching: sanity-scrub 5



 Comments   
Comment by Alexander Boyko [ 15/Sep/17 ]

I can reproduce this issue with the next test, which I created for LU-9714

[ 1557.907952] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[ 1557.907976] IP: [<ffffffffa07ceff7>] llog_process_thread+0x37/0x1180 [obdclass]
[ 1557.908039] PGD 53d17067 PUD 54b44067 PMD 0 
[ 1557.908054] Oops: 0000 [#1] SMP 
[ 1557.908065] Modules linked in: loop lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) mbcache lquota(OE) lfsck(OE) jbd2 obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs fscache snd_seq_midi snd_seq_midi_event coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel snd_ens1371 snd_rawmidi snd_ac97_codec aesni_intel ppdev ac97_bus lrw snd_seq gf128mul glue_helper ablk_helper cryptd snd_seq_device snd_pcm vmw_balloon pcspkr snd_timer uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev snd soundcore sg shpchp vmw_vmci i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif cdrom crct10dif_generic
[ 1557.908340]  ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel vmwgfx serio_raw drm_kms_helper ttm drm mptspi e1000 scsi_transport_spi ata_piix mptscsih mptbase libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
[ 1557.910750] CPU: 2 PID: 13785 Comm: lctl Tainted: G           OE  ------------   3.10.0-327.10.1.el7_lustreb_neo_stable_us_MRP_3474.x86_64 #1
[ 1557.913468] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[ 1557.915940] task: ffff8801143fe780 ti: ffff88011566c000 task.ti: ffff88011566c000
[ 1557.916708] RIP: 0010:[<ffffffffa07ceff7>]  [<ffffffffa07ceff7>] llog_process_thread+0x37/0x1180 [obdclass]
[ 1557.918992] RSP: 0018:ffff88011566f968  EFLAGS: 00010246
[ 1557.920400] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011566ffd8
[ 1557.921723] RDX: ffff88011566f9e8 RSI: 0000000000000000 RDI: ffff8800549e2a20
[ 1557.922901] RBP: ffff88011566fa10 R08: 0000000000017540 R09: ffff88013a007a00
[ 1557.924062] R10: ffff8800549e2a20 R11: 207865646e692074 R12: 0000000000000000
[ 1557.925257] R13: ffff88011566fcc0 R14: ffff8800549e2a20 R15: 0000000000000000
[ 1557.926446] FS:  00007f80aaab8740(0000) GS:ffff88013a640000(0000) knlGS:0000000000000000
[ 1557.927565] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1557.928445] CR2: 0000000000000060 CR3: 0000000050b6a000 CR4: 00000000001407e0
[ 1557.929127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1557.929833] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1557.930501] Stack:
[ 1557.931120]  ffffffffa0657b57 ffff880100000010 ffff88011566f9d0 ffff88011566f990
[ 1557.931794]  000000005700bf62 ffff880118562000 ffff8800bb7e2c00 ffff8800549e2a20
[ 1557.932417]  0000000000000005 000000000000000a 0000000000000000 ffff88011566fa00
[ 1557.933021] Call Trace:
[ 1557.933638]  [<ffffffffa0657b57>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 1557.934245]  [<ffffffffa07d49bf>] ? llog_cat_cleanup+0x15f/0x380 [obdclass]
[ 1557.934880]  [<ffffffffa0fd8820>] ? mdd_changelog_fini+0x1c0/0x1c0 [mdd]
[ 1557.935544]  [<ffffffffa07d01fc>] llog_process_or_fork+0xbc/0x450 [obdclass]
[ 1557.936105]  [<ffffffffa07d572a>] llog_cat_process_cb+0x20a/0x220 [obdclass]
[ 1557.936718]  [<ffffffffa07cf565>] llog_process_thread+0x5a5/0x1180 [obdclass]
[ 1557.937342]  [<ffffffffa07d5520>] ? llog_cat_process_common+0x440/0x440 [obdclass]
[ 1557.938130]  [<ffffffffa07d01fc>] llog_process_or_fork+0xbc/0x450 [obdclass]
[ 1557.938726]  [<ffffffffa07d5520>] ? llog_cat_process_common+0x440/0x440 [obdclass]
[ 1557.939324]  [<ffffffffa07d46a9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
[ 1557.939894]  [<ffffffffa07d02c5>] ? llog_process_or_fork+0x185/0x450 [obdclass]
[ 1557.940414]  [<ffffffffa0fd8820>] ? mdd_changelog_fini+0x1c0/0x1c0 [mdd]
[ 1557.941010]  [<ffffffffa07d47de>] llog_cat_process+0x2e/0x30 [obdclass]
[ 1557.941984]  [<ffffffffa0fd7a18>] llog_changelog_cancel+0x58/0x1d0 [mdd]
[ 1557.942541]  [<ffffffffa0fd94b7>] ? mdd_changelog_write_header+0x67/0x470 [mdd]
[ 1557.943062]  [<ffffffffa07d6334>] llog_cancel+0x54/0x240 [obdclass]
[ 1557.943574]  [<ffffffffa0fd9a9d>] mdd_changelog_llog_cancel+0xbd/0x250 [mdd]
[ 1557.944058]  [<ffffffffa0fdd286>] mdd_iocontrol+0x12c6/0x16b0 [mdd]
[ 1557.944517]  [<ffffffffa08147f4>] ? keys_fill+0xf4/0x170 [obdclass]
[ 1557.944992]  [<ffffffffa10486be>] mdt_iocontrol+0x5de/0xaf0 [mdt]
[ 1557.945448]  [<ffffffffa07f8439>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[ 1557.945892]  [<ffffffffa07e3c13>] class_handle_ioctl+0x1933/0x1dc0 [obdclass]
[ 1557.946376]  [<ffffffff811970c8>] ? handle_mm_fault+0x5b8/0xf50
[ 1557.946890]  [<ffffffff812856b8>] ? security_capable+0x18/0x20
[ 1557.947420]  [<ffffffffa07c95d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
[ 1557.947974]  [<ffffffff811f1ec5>] do_vfs_ioctl+0x2e5/0x4c0
[ 1557.948400]  [<ffffffff8164119d>] ? __do_page_fault+0x16d/0x450
[ 1557.948841]  [<ffffffff811f2141>] SyS_ioctl+0xa1/0xc0
[ 1557.949276]  [<ffffffff81645c89>] system_call_fastpath+0x16/0x1b

test

#!/bin/bash

NUM=100

dmesg -c > /dev/null
#register users
for (( i = 1; i < $NUM; i++)); do
        cl_users[$i]=`lctl --device lustre-MDT0000 changelog_register -n`
done

first=`lctl get_param  mdd.lustre-MDT0000.changelog_users | tail -n1 | cut -d" " -f2`
./lustre/tests/createmany -m /mnt/lustre/test_file_ 10000
./lustre/tests/unlinkmany /mnt/lustre/test_file_ 10000

for (( i = 1; i < $NUM; i++)); do
        lfs changelog_clear  lustre-MDT0000 ${cl_users[$i]} $((first+20000)) &
        lfs changelog_clear  lustre-MDT0000 ${cl_users[1]} $((first+20000)) &

done


#unregister users
for (( i = 1; i < $NUM; i++)); do
        lctl --device lustre-MDT0000 changelog_deregister ${cl_users[$i]} > /dev/null &
        lctl get_param  mdd.lustre-MDT0000.changelog_users > /dev/null &
done

sleep 5

race_str=`dmesg | tail -n50 | grep 'modify index mismatch'`

echo $race_str
[[ !  -z  $race_str  ]] && echo 'Race happened'

it requires a loop, something like

while true; do sh ../mrp-4455/test_changelog.sh; done

Comment by Alexander Boyko [ 15/Sep/17 ]

I guess that the last line failed, lloghandle is NULL

static int llog_process_thread(void *arg)
{
        struct llog_process_info        *lpi = arg;
        struct llog_handle              *loghandle = lpi->lpi_loghandle;
        struct llog_log_hdr             *llh = loghandle->lgh_hdr;   <====

 

Comment by Bruno Faccini (Inactive) [ 09/Oct/17 ]

Hmm, looks like patch from LU-9153 ("llog: consolidate common error checking") has introduced a regression in llog_cat_id2handle() error handling path (see error/msg just preceding the crash) by adding llog_cat_process_common() common routine additional call in sequence :

static int llog_cat_process_common(const struct lu_env *env,
                                   struct llog_handle *cat_llh,
                                   struct llog_rec_hdr *rec,
                                   struct llog_handle **llhp)
{
        struct llog_logid_rec *lir = container_of(rec, typeof(*lir), lid_hdr);
        struct llog_log_hdr *hdr;
        int rc;
         
        ENTRY;
        if (rec->lrh_type != le32_to_cpu(LLOG_LOGID_MAGIC)) {
                rc = -EINVAL;
                CWARN("%s: invalid record in catalog "DFID":%x: rc = %d\n",
                      cat_llh->lgh_ctxt->loc_obd->obd_name,
                      PFID(&cat_llh->lgh_id.lgl_oi.oi_fid),
                      cat_llh->lgh_id.lgl_ogen, rc);
                RETURN(rc);
        } 
        CDEBUG(D_HA, "processing log "DFID":%x at index %u of catalog "DFID"\n",
               PFID(&lir->lid_id.lgl_oi.oi_fid), lir->lid_id.lgl_ogen,
               le32_to_cpu(rec->lrh_index),
               PFID(&cat_llh->lgh_id.lgl_oi.oi_fid));
    
        rc = llog_cat_id2handle(env, cat_llh, llhp, &lir->lid_id);
        if (rc) {
                /* After a server crash, a stub of index record in catlog could
                 * be kept, because plain log destroy + catlog index record
                 * deletion are not atomic. So we end up with an index but no
                 * actual record. Destroy the index and move on. */
                if (rc == -ENOENT || rc == -ESTALE)
                        rc = llog_cat_cleanup(env, cat_llh, NULL,  <<<<<<<<< rc can be cleared!!!
                                              rec->lrh_index);
                if (rc)
                        CWARN("%s: can't find llog handle "DFID":%x: rc = %d\n",
                              cat_llh->lgh_ctxt->loc_obd->obd_name,
                              PFID(&lir->lid_id.lgl_oi.oi_fid),
                              lir->lid_id.lgl_ogen, rc);

                RETURN(rc);
        }
.......................

static int llog_cat_process_cb(const struct lu_env *env,
                               struct llog_handle *cat_llh,
                               struct llog_rec_hdr *rec, void *data)
{
        struct llog_process_data *d = data;
        struct llog_handle *llh = NULL;
        int rc;

        ENTRY;
        rc = llog_cat_process_common(env, cat_llh, rec, &llh);
        if (rc) <<<<<<<<<<<<<<< no error handling??
                GOTO(out, rc);

        if (rec->lrh_index < d->lpd_startcat) {
                /* Skip processing of the logs until startcat */
                rc = 0;
        } else if (d->lpd_startidx > 0) {
                struct llog_process_cat_data cd;
        
                cd.lpcd_first_idx = d->lpd_startidx;
                cd.lpcd_last_idx = 0;
                rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data,  <<< can be called with llh==NULL 
                                          &cd, false);
                /* Continue processing the next log from idx 0 */
                d->lpd_startidx = 0;
        } else {
                rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL
                                          NULL, false);
        }
    
............................

Will try to cook a patch to fix this asap.

Comment by Gerrit Updater [ 09/Oct/17 ]

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29370
Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e9c7d9a668f01b4c8db741a6a92f037e5bfa1982

Comment by Bruno Faccini (Inactive) [ 12/Oct/17 ]

BTW, at the moment, there is a running serie of 57 identical failures/timeouts of sanity-scrub/test_5 being reported in Maloo and starting on July 19th (date when "LU-9153 llog: consolidate common error checking" has landed on master !!).

Comment by Bruno Faccini (Inactive) [ 12/Oct/17 ]

Upgrading priority to Major as it appears to have some impact on auto-tests.

Comment by Gerrit Updater [ 14/Oct/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/29616
Subject: LU-9994 obdclass: skip llog_process for empty handle
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bed20ef1b81873a2d7991f84c7fc7a165ecd8b8c

Comment by nasf (Inactive) [ 14/Oct/17 ]

I made the patch because most of the LU-7585 patches failed on Maloo. I am not sure whether this patch can resolve the issue or not. But let's try.

Comment by Bob Glossman (Inactive) [ 02/Nov/17 ]

I see two proposed fixes here:
https://review.whamcloud.com/29370
https://review.whamcloud.com/29616

Are both needed? If not is one preferred over the other?

Comment by Gerrit Updater [ 03/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29370/
Subject: LU-9994 obdclass: fix llog_cat_id2handle() error handling
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 97fe84143da81d42770ae29298403aad80103f94

Comment by Andreas Dilger [ 07/Nov/17 ]

Only one of the two patches needed to land.

Generated at Sat Feb 10 02:31:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.