[LU-9994] sanity-scrub test_5: test failed to respond and timed out Created: 15/Sep/17 Updated: 07/Nov/17 Resolved: 07/Nov/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Bruno Faccini (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Bob Glossman <bob.glossman@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/35ecbbdc-99c2-11e7-b778-5254006e85c2. The sub-test test_5 failed with the following error: test failed to respond and timed out This failure is a panic on MDS. It has a different stack trace than 20:49:39:[15096.775030] LustreError: 1321:0:(llog_cat.c:269:llog_cat_id2handle()) lustre-MDT0000-osp-MDT0002: error opening log id [0x2:0x402:0x2]:0: rc = -2 20:49:39:[15096.776495] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 20:49:39:[15096.777310] IP: [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass] 20:49:39:[15096.778071] PGD 78ab5067 PUD 78ab4067 PMD 0 20:49:39:[15096.778551] Oops: 0000 [#1] SMP 20:49:39:[15096.778921] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel nfsd aesni_intel lrw gf128mul ppdev glue_helper ablk_helper cryptd i2c_piix4 nfs_acl lockd joydev pcspkr i2c_core virtio_balloon auth_rpcgss grace parport_pc parport sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 ata_piix virtio_blk libata 8139too crct10dif_pclmul crct10dif_common crc32c_intel serio_raw virtio_pci 8139cp virtio_ring virtio mii floppy 20:49:39:[15096.788032] CPU: 1 PID: 1321 Comm: lod0002_rec0000 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.x86_64 #1 20:49:39:[15096.789136] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 20:49:39:[15096.789694] task: ffff880067595ee0 ti: ffff8800606bc000 task.ti: ffff8800606bc000 20:49:39:[15096.790413] RIP: 0010:[<ffffffffc07d002a>] [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass] 20:49:39:[15096.791385] RSP: 0018:ffff8800606bfb28 EFLAGS: 00010246 20:49:39:[15096.791910] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800606bffd8 20:49:39:[15096.792588] RDX: ffff8800606bfbb8 RSI: 0000000000000000 RDI: ffff88006797f180 20:49:39:[15096.793279] RBP: ffff8800606bfbe0 R08: 0000000000019be0 R09: ffff88007d001a00 20:49:39:[15096.793977] R10: ffffffffc07d14a6 R11: 000000000000000f R12: 0000000000000000 20:49:39:[15096.794662] R13: ffff8800667ebc00 R14: ffff88006797f180 R15: 0000000000000000 20:49:39:[15096.795352] FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 20:49:39:[15096.796149] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 20:49:39:[15096.796704] CR2: 0000000000000060 CR3: 0000000077c09000 CR4: 00000000000406e0 20:49:39:[15096.797387] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 20:49:39:[15096.798084] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 20:49:39:[15096.798778] Stack: 20:49:39:[15096.798987] ffff8800606bfe58 ffff8800606bfb90 ffffffffc05eeba7 ffff880000000010 20:49:39:[15096.799777] ffff8800606bfba0 ffff8800606bfb60 0000000075f08900 ffff88007a318000 20:49:39:[15096.800574] ffff88006797f180 0000000000000001 0000000000000001 0000000000000402 20:49:39:[15096.801374] Call Trace: 20:49:39:[15096.801634] [<ffffffffc05eeba7>] ? libcfs_debug_msg+0x57/0x80 [libcfs] 20:49:39:[15096.802295] [<ffffffffc07d5ccf>] ? llog_cat_cleanup+0x15f/0x380 [obdclass] 20:49:39:[15096.802990] [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] 20:49:39:[15096.803639] [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass] 20:49:39:[15096.804337] [<ffffffffc07d6a3a>] llog_cat_process_cb+0x20a/0x220 [obdclass] 20:49:39:[15096.805040] [<ffffffffc07d0865>] llog_process_thread+0x875/0x1460 [obdclass] 20:49:39:[15096.805752] [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass] 20:49:39:[15096.806488] [<ffffffffc07d150c>] llog_process_or_fork+0xbc/0x450 [obdclass] 20:49:39:[15096.807184] [<ffffffffc07d6830>] ? llog_cat_process_common+0x440/0x440 [obdclass] 20:49:39:[15096.807936] [<ffffffffc07d59b9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] 20:49:39:[15096.808673] [<ffffffffc0f1ae2a>] ? lod_sub_prep_llog+0x24a/0x783 [lod] 20:49:39:[15096.809322] [<ffffffffc0eea6b0>] ? lodname2mdt_index+0x2f0/0x2f0 [lod] 20:49:39:[15096.809976] [<ffffffffc07d5aee>] llog_cat_process+0x2e/0x30 [obdclass] 20:49:39:[15096.810613] [<ffffffffc0ee6a89>] lod_sub_recovery_thread+0x439/0xc80 [lod] 20:49:39:[15096.811300] [<ffffffffc0ee6650>] ? lod_trans_stop+0x340/0x340 [lod] 20:49:39:[15096.811920] [<ffffffff810b098f>] kthread+0xcf/0xe0 20:49:39:[15096.812393] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 20:49:39:[15096.812987] [<ffffffff816b4f18>] ret_from_fork+0x58/0x90 20:49:39:[15096.813516] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 20:49:39:[15096.814113] Code: 41 54 53 48 81 ec 90 00 00 00 4c 8b 27 48 8b 47 18 65 48 8b 34 25 28 00 00 00 48 89 75 d0 31 f6 f6 05 d2 d5 e3 ff 01 48 89 7d 88 <4d> 8b 6c 24 60 48 89 45 80 c7 45 c4 00 00 00 00 74 0d f6 05 b9 20:49:39:[15096.817178] RIP [<ffffffffc07d002a>] llog_process_thread+0x3a/0x1460 [obdclass] 20:49:39:[15096.817925] RSP <ffff8800606bfb28> 20:49:39:[15096.818270] CR2: 0000000000000060 Info required for matching: sanity-scrub 5 |
| Comments |
| Comment by Alexander Boyko [ 15/Sep/17 ] |
|
I can reproduce this issue with the next test, which I created for [ 1557.907952] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060 [ 1557.907976] IP: [<ffffffffa07ceff7>] llog_process_thread+0x37/0x1180 [obdclass] [ 1557.908039] PGD 53d17067 PUD 54b44067 PMD 0 [ 1557.908054] Oops: 0000 [#1] SMP [ 1557.908065] Modules linked in: loop lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) mbcache lquota(OE) lfsck(OE) jbd2 obdecho(OE) mgc(OE) lov(OE) osc(OE) mdc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) nfsv3 nfs fscache snd_seq_midi snd_seq_midi_event coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel snd_ens1371 snd_rawmidi snd_ac97_codec aesni_intel ppdev ac97_bus lrw snd_seq gf128mul glue_helper ablk_helper cryptd snd_seq_device snd_pcm vmw_balloon pcspkr snd_timer uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev snd soundcore sg shpchp vmw_vmci i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod sr_mod crc_t10dif cdrom crct10dif_generic [ 1557.908340] ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel vmwgfx serio_raw drm_kms_helper ttm drm mptspi e1000 scsi_transport_spi ata_piix mptscsih mptbase libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [ 1557.910750] CPU: 2 PID: 13785 Comm: lctl Tainted: G OE ------------ 3.10.0-327.10.1.el7_lustreb_neo_stable_us_MRP_3474.x86_64 #1 [ 1557.913468] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1557.915940] task: ffff8801143fe780 ti: ffff88011566c000 task.ti: ffff88011566c000 [ 1557.916708] RIP: 0010:[<ffffffffa07ceff7>] [<ffffffffa07ceff7>] llog_process_thread+0x37/0x1180 [obdclass] [ 1557.918992] RSP: 0018:ffff88011566f968 EFLAGS: 00010246 [ 1557.920400] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88011566ffd8 [ 1557.921723] RDX: ffff88011566f9e8 RSI: 0000000000000000 RDI: ffff8800549e2a20 [ 1557.922901] RBP: ffff88011566fa10 R08: 0000000000017540 R09: ffff88013a007a00 [ 1557.924062] R10: ffff8800549e2a20 R11: 207865646e692074 R12: 0000000000000000 [ 1557.925257] R13: ffff88011566fcc0 R14: ffff8800549e2a20 R15: 0000000000000000 [ 1557.926446] FS: 00007f80aaab8740(0000) GS:ffff88013a640000(0000) knlGS:0000000000000000 [ 1557.927565] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1557.928445] CR2: 0000000000000060 CR3: 0000000050b6a000 CR4: 00000000001407e0 [ 1557.929127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1557.929833] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1557.930501] Stack: [ 1557.931120] ffffffffa0657b57 ffff880100000010 ffff88011566f9d0 ffff88011566f990 [ 1557.931794] 000000005700bf62 ffff880118562000 ffff8800bb7e2c00 ffff8800549e2a20 [ 1557.932417] 0000000000000005 000000000000000a 0000000000000000 ffff88011566fa00 [ 1557.933021] Call Trace: [ 1557.933638] [<ffffffffa0657b57>] ? libcfs_debug_msg+0x57/0x80 [libcfs] [ 1557.934245] [<ffffffffa07d49bf>] ? llog_cat_cleanup+0x15f/0x380 [obdclass] [ 1557.934880] [<ffffffffa0fd8820>] ? mdd_changelog_fini+0x1c0/0x1c0 [mdd] [ 1557.935544] [<ffffffffa07d01fc>] llog_process_or_fork+0xbc/0x450 [obdclass] [ 1557.936105] [<ffffffffa07d572a>] llog_cat_process_cb+0x20a/0x220 [obdclass] [ 1557.936718] [<ffffffffa07cf565>] llog_process_thread+0x5a5/0x1180 [obdclass] [ 1557.937342] [<ffffffffa07d5520>] ? llog_cat_process_common+0x440/0x440 [obdclass] [ 1557.938130] [<ffffffffa07d01fc>] llog_process_or_fork+0xbc/0x450 [obdclass] [ 1557.938726] [<ffffffffa07d5520>] ? llog_cat_process_common+0x440/0x440 [obdclass] [ 1557.939324] [<ffffffffa07d46a9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass] [ 1557.939894] [<ffffffffa07d02c5>] ? llog_process_or_fork+0x185/0x450 [obdclass] [ 1557.940414] [<ffffffffa0fd8820>] ? mdd_changelog_fini+0x1c0/0x1c0 [mdd] [ 1557.941010] [<ffffffffa07d47de>] llog_cat_process+0x2e/0x30 [obdclass] [ 1557.941984] [<ffffffffa0fd7a18>] llog_changelog_cancel+0x58/0x1d0 [mdd] [ 1557.942541] [<ffffffffa0fd94b7>] ? mdd_changelog_write_header+0x67/0x470 [mdd] [ 1557.943062] [<ffffffffa07d6334>] llog_cancel+0x54/0x240 [obdclass] [ 1557.943574] [<ffffffffa0fd9a9d>] mdd_changelog_llog_cancel+0xbd/0x250 [mdd] [ 1557.944058] [<ffffffffa0fdd286>] mdd_iocontrol+0x12c6/0x16b0 [mdd] [ 1557.944517] [<ffffffffa08147f4>] ? keys_fill+0xf4/0x170 [obdclass] [ 1557.944992] [<ffffffffa10486be>] mdt_iocontrol+0x5de/0xaf0 [mdt] [ 1557.945448] [<ffffffffa07f8439>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [ 1557.945892] [<ffffffffa07e3c13>] class_handle_ioctl+0x1933/0x1dc0 [obdclass] [ 1557.946376] [<ffffffff811970c8>] ? handle_mm_fault+0x5b8/0xf50 [ 1557.946890] [<ffffffff812856b8>] ? security_capable+0x18/0x20 [ 1557.947420] [<ffffffffa07c95d2>] obd_class_ioctl+0xd2/0x170 [obdclass] [ 1557.947974] [<ffffffff811f1ec5>] do_vfs_ioctl+0x2e5/0x4c0 [ 1557.948400] [<ffffffff8164119d>] ? __do_page_fault+0x16d/0x450 [ 1557.948841] [<ffffffff811f2141>] SyS_ioctl+0xa1/0xc0 [ 1557.949276] [<ffffffff81645c89>] system_call_fastpath+0x16/0x1b test #!/bin/bash
NUM=100
dmesg -c > /dev/null
#register users
for (( i = 1; i < $NUM; i++)); do
cl_users[$i]=`lctl --device lustre-MDT0000 changelog_register -n`
done
first=`lctl get_param mdd.lustre-MDT0000.changelog_users | tail -n1 | cut -d" " -f2`
./lustre/tests/createmany -m /mnt/lustre/test_file_ 10000
./lustre/tests/unlinkmany /mnt/lustre/test_file_ 10000
for (( i = 1; i < $NUM; i++)); do
lfs changelog_clear lustre-MDT0000 ${cl_users[$i]} $((first+20000)) &
lfs changelog_clear lustre-MDT0000 ${cl_users[1]} $((first+20000)) &
done
#unregister users
for (( i = 1; i < $NUM; i++)); do
lctl --device lustre-MDT0000 changelog_deregister ${cl_users[$i]} > /dev/null &
lctl get_param mdd.lustre-MDT0000.changelog_users > /dev/null &
done
sleep 5
race_str=`dmesg | tail -n50 | grep 'modify index mismatch'`
echo $race_str
[[ ! -z $race_str ]] && echo 'Race happened'
it requires a loop, something like while true; do sh ../mrp-4455/test_changelog.sh; done |
| Comment by Alexander Boyko [ 15/Sep/17 ] |
|
I guess that the last line failed, lloghandle is NULL static int llog_process_thread(void *arg)
{
struct llog_process_info *lpi = arg;
struct llog_handle *loghandle = lpi->lpi_loghandle;
struct llog_log_hdr *llh = loghandle->lgh_hdr; <====
|
| Comment by Bruno Faccini (Inactive) [ 09/Oct/17 ] |
|
Hmm, looks like patch from static int llog_cat_process_common(const struct lu_env *env,
struct llog_handle *cat_llh,
struct llog_rec_hdr *rec,
struct llog_handle **llhp)
{
struct llog_logid_rec *lir = container_of(rec, typeof(*lir), lid_hdr);
struct llog_log_hdr *hdr;
int rc;
ENTRY;
if (rec->lrh_type != le32_to_cpu(LLOG_LOGID_MAGIC)) {
rc = -EINVAL;
CWARN("%s: invalid record in catalog "DFID":%x: rc = %d\n",
cat_llh->lgh_ctxt->loc_obd->obd_name,
PFID(&cat_llh->lgh_id.lgl_oi.oi_fid),
cat_llh->lgh_id.lgl_ogen, rc);
RETURN(rc);
}
CDEBUG(D_HA, "processing log "DFID":%x at index %u of catalog "DFID"\n",
PFID(&lir->lid_id.lgl_oi.oi_fid), lir->lid_id.lgl_ogen,
le32_to_cpu(rec->lrh_index),
PFID(&cat_llh->lgh_id.lgl_oi.oi_fid));
rc = llog_cat_id2handle(env, cat_llh, llhp, &lir->lid_id);
if (rc) {
/* After a server crash, a stub of index record in catlog could
* be kept, because plain log destroy + catlog index record
* deletion are not atomic. So we end up with an index but no
* actual record. Destroy the index and move on. */
if (rc == -ENOENT || rc == -ESTALE)
rc = llog_cat_cleanup(env, cat_llh, NULL, <<<<<<<<< rc can be cleared!!!
rec->lrh_index);
if (rc)
CWARN("%s: can't find llog handle "DFID":%x: rc = %d\n",
cat_llh->lgh_ctxt->loc_obd->obd_name,
PFID(&lir->lid_id.lgl_oi.oi_fid),
lir->lid_id.lgl_ogen, rc);
RETURN(rc);
}
.......................
static int llog_cat_process_cb(const struct lu_env *env,
struct llog_handle *cat_llh,
struct llog_rec_hdr *rec, void *data)
{
struct llog_process_data *d = data;
struct llog_handle *llh = NULL;
int rc;
ENTRY;
rc = llog_cat_process_common(env, cat_llh, rec, &llh);
if (rc) <<<<<<<<<<<<<<< no error handling??
GOTO(out, rc);
if (rec->lrh_index < d->lpd_startcat) {
/* Skip processing of the logs until startcat */
rc = 0;
} else if (d->lpd_startidx > 0) {
struct llog_process_cat_data cd;
cd.lpcd_first_idx = d->lpd_startidx;
cd.lpcd_last_idx = 0;
rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL
&cd, false);
/* Continue processing the next log from idx 0 */
d->lpd_startidx = 0;
} else {
rc = llog_process_or_fork(env, llh, d->lpd_cb, d->lpd_data, <<< can be called with llh==NULL
NULL, false);
}
............................
Will try to cook a patch to fix this asap. |
| Comment by Gerrit Updater [ 09/Oct/17 ] |
|
Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: https://review.whamcloud.com/29370 |
| Comment by Bruno Faccini (Inactive) [ 12/Oct/17 ] |
|
BTW, at the moment, there is a running serie of 57 identical failures/timeouts of sanity-scrub/test_5 being reported in Maloo and starting on July 19th (date when " |
| Comment by Bruno Faccini (Inactive) [ 12/Oct/17 ] |
|
Upgrading priority to Major as it appears to have some impact on auto-tests. |
| Comment by Gerrit Updater [ 14/Oct/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/29616 |
| Comment by nasf (Inactive) [ 14/Oct/17 ] |
|
I made the patch because most of the |
| Comment by Bob Glossman (Inactive) [ 02/Nov/17 ] |
|
I see two proposed fixes here: Are both needed? If not is one preferred over the other? |
| Comment by Gerrit Updater [ 03/Nov/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/29370/ |
| Comment by Andreas Dilger [ 07/Nov/17 ] |
|
Only one of the two patches needed to land. |