[LU-10740] replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed Created: 28/Feb/18 Updated: 26/Feb/19 Resolved: 26/Feb/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Gryaznova | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | dne, zfs | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||
| Description |
|
2.7 (IEEL) based server, 2.10.58 client == replay-single test 2d: setdirstripe replay ======================================================== 00:25:51 (1519604751) UUID 1K-blocks Used Available Use% Mounted on lustre-MDT0000_UUID 344152 2752 317200 1% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 344152 2492 317460 1% /mnt/lustre[MDT:1] lustre-OST0000_UUID 664400 17712 597688 3% /mnt/lustre[OST:0] lustre-OST0001_UUID 664400 17712 597688 3% /mnt/lustre[OST:1] filesystem_summary: 1328800 35424 1195376 3% /mnt/lustre Failing mds1 on fre1209 Stopping /mnt/lustre-mds1 (opts:) on fre1209 pdsh@fre1211: fre1209: ssh exited with exit code 1 reboot facets: mds1 Failover mds1 to fre1209 00:26:02 (1519604762) waiting for fre1209 network 900 secs ... 00:26:02 (1519604762) network interface is UP mount facets: mds1 Starting mds1: -o rw,user_xattr /dev/vdb /mnt/lustre-mds1 pdsh@fre1211: fre1209: ssh exited with exit code 1 pdsh@fre1211: fre1209: ssh exited with exit code 1 Started lustre-MDT0000 fre1212: fre1212: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid fre1211: fre1211: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec Can't lstat /mnt/lustre/d2d.replay-single: No such file or directory replay-single test_2d: @@@@@@ FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed 100% reproducible with review-dne-zfs-part-4 |
| Comments |
| Comment by Sarah Liu [ 15/Mar/18 ] |
|
also hit on master branch, zfs DNE https://testing.hpdd.intel.com/test_sets/d05367b0-2799-11e8-9e0e-52540065bddc mds dmesg
[64823.580533] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 2d: setdirstripe replay ======================================================== 17:06:03 \(1520960763\)
[64823.808722] Lustre: DEBUG MARKER: == replay-single test 2d: setdirstripe replay ======================================================== 17:06:03 (1520960763)
[64824.017795] Lustre: DEBUG MARKER: sync; sync; sync
[64825.224138] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[64825.565253] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
[64826.042689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[64826.212687] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[64826.695274] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[64827.035062] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[64827.574756] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
[64827.930760] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
grep -q ^lustre-mdt1/ /proc/mounts ||
zpool export lustre-mdt1
[64838.339263] Lustre: DEBUG MARKER: hostname
[64838.749896] Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs;
zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
zpool import -f -o cachefile=none -o failmode=panic -d /dev/lvm-Role_MDS lustre-mdt1
[64839.417169] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1
[64839.751101] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre lustre-mdt1/mdt1 /mnt/lustre-mds1
[64841.981159] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[64842.324523] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
[64842.971945] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64842.974147] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.192600] Lustre: DEBUG MARKER: onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.192672] Lustre: DEBUG MARKER: onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.395032] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
[64843.733809] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]\{3}[0-9]\{4}'
[64844.079675] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]\{3}[0-9]\{4}'
[64844.425889] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null
[64845.202422] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
[64845.389052] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm5.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.389056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm6.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.655584] Lustre: DEBUG MARKER: onyx-43vm5.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.687723] Lustre: DEBUG MARKER: onyx-43vm6.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64846.599842] LustreError: 29449:0:(llog.c:1012:llog_write_rec()) lustre-MDT0000-osd: loghandle ffff88004c471e00 with no header
[64846.601307] LustreError: 29449:0:(llog_cat.c:500:llog_cat_add_rec()) llog_write_rec -71: lh=ffff88004c471e00
[64846.602541] LustreError: 29449:0:(update_trans.c:994:top_trans_stop()) lustre-MDT0000-mdtlov: write updates failed: rc = -71
[64846.603937] LustreError: 29449:0:(tgt_lastrcvd.c:1226:tgt_last_rcvd_update()) lustre-MDT0000: replay transno 38654705669 failed: rc = -71
[64846.613367] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[64846.614170] LustreError: 29449:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304
[64847.016802] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 1 sec
[64847.244576] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 1 sec
[64854.015289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 8 sec
[64854.220999] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 8 sec
[64855.647689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_2d: @@@@@@ FAIL: checkstat -v \/mnt\/lustre\/d2d.replay-single check failed
[64855.876070] Lustre: DEBUG MARKER: replay-single test_2d: @@@@@@ FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed
|
| Comment by Andreas Dilger [ 08/Aug/18 ] |
|
The error being returned is -71 = -EREMOTE, which suggests that there is some problem sending the RPC to the wrong MDT. However, that may also be some kind of fallout from "loghandle XXX with no header", which implies that the llogfile is zero length and wasn't written correctly? It's hard to see why the -EREMOTE problem could relate to only DNE+ZFS, while the second one seems more plausible. |
| Comment by Andreas Dilger [ 08/Aug/18 ] |
|
Lai, could you please take a look? James reports that this is failing 100% of the time with review-dne-zfs-part-4, which is currently an optional test run, but we want to make it a required test. |
| Comment by Gerrit Updater [ 10/Aug/18 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32976 |
| Comment by Lai Siyao [ 22/Aug/18 ] |
|
The failed call trace is: llog_add()
-> llog_cat_add_rec()
-> llog_cat_new_log()
-> llog_osd_create()
-> osd_create()
-> osd_zap_add()
return -EEXIST
but loghandle->lgh_hdr is NULL, which caused llog_write_rec() return -EPROTO (-71). But I don't understand why it fails on ZFS only. |
| Comment by Lai Siyao [ 24/Aug/18 ] |
|
I found something strange in the log: 40000000:00000040:1.0:1535051095.957588:0:12953:0:(fid_handler.c:348:__seq_server_alloc_meta()) srv-lustre-MDT0000: Allocated meta-sequence [0x0000000200001b70-0x0000000200001b71]:0:mdt 40000000:00000040:1.0:1535051095.957598:0:12953:0:(fid_request.c:400:seq_client_alloc_fid()) cli-lustre-MDT0000: Allocated FID [0x200001b70:0x1:0x0] 00000004:00000040:1.0:1535051095.957601:0:12953:0:(lod_object.c:1984:lod_prep_md_striped_create()) Get idx 0, for stripe 0 [0x200001b70:0x1:0x0] 00000004:00000040:1.0:1535051095.957607:0:12953:0:(mdt_handler.c:5324:mdt_object_init()) object init, fid = [0x200001b70:0x1:0x0] 00000001:00000002:1.0:1535051095.958937:0:12953:0:(linkea.c:173:linkea_add_buf()) New link_ea name '[0x2000013a0:0xb:0x0]:[0x200001b70:0x1:0x0]:0' is added 00000020:00000040:1.0:1535051095.960470:0:12953:0:(update_records.c:240:update_records_update_pack()) 4th [0x200001b70:0x1:0x0] create param_count = 7 00000040:00000001:0.0:1535051123.894990:0:16387:0:(llog_osd.c:1225:llog_osd_open()) Process entered 00000040:00000001:0.0:1535051123.894991:0:16387:0:(obd_class.h:911:obd_fid_alloc()) Process entered 40000000:00000040:0.0:1535051123.941029:0:16387:0:(fid_request.c:400:seq_client_alloc_fid()) cli-lustre-MDT0000: Allocated FID [0x200001b70:0x1:0x0] The scenario is as below: I'm not sure whether it's related with ZFS, because seq_store_update() has updated FLDB with th_sync set. |
| Comment by Andreas Dilger [ 24/Aug/18 ] |
|
Is it possible the MDT is running with osd_object_sync_delay_us=0? This was being proposed in a recent patch, but I didn't think it was landed yet, and was initially limited to a single test. |
| Comment by Lai Siyao [ 26/Aug/18 ] |
|
It doesn't look to be related with osd_object_sync_delay_us. I added some debug log, the result shows th_sync doesn't work on ZFS, the debug log shows seq file was sync written (osd_trans_commit_cb() was called before osd_trans_stop()), but after replay barrier, server read stale seq file on start up. I also tweaked the code to do create in sync mode, the result also shows file not exist after restart. Is there a simple way to verify this? |
| Comment by Andreas Dilger [ 26/Aug/18 ] |
|
One problem we have with ZFS TXG sync is that this doesn't force a commit, only block the thread until the TXG commit happens. That should be equivalent from the point of view of the client, but it is possible the test doesn't expect this? Until we have ZIL support, it would be nice to have a way to force TXG commit, especially with flash devices that have no seek latency. |
| Comment by Lai Siyao [ 27/Aug/18 ] |
|
FID seq file is updated in sync mode, so upon MDS fail, it won't allocate duplicate FID. replay-single.sh 2d failed because it happens to allocate a new super sequence before fail, If it's run separately it can pass. |
| Comment by Alex Zhuravlev [ 28/Aug/18 ] |
|
Andreas, I added a simple debugging data - print time it takes to commit TXG upon enforced commit and initiated that many times during sanity.sh: sync took 8276 usec |
| Comment by Andreas Dilger [ 28/Aug/18 ] |
|
Alex, what are you using to force a TXG commit in your tests, and are you sure that the commit is actually making it to disk? It seems very fast in some cases, 0.1ms for a commit... static int osd_trans_stop(const struct lu_env *env, struct dt_device *dt, struct thandle *th) { if (sync) { if (osd_txg_sync_delay_us < 0) txg_wait_synced(dmu_objset_pool(osd->od_os), txg); else udelay(osd_txg_sync_delay_us); } My understanding is that txg_wait_synced() will just wait for the TXG commit to happen, it can't necessarily force a TXG commit like jbd2_journal_stop() can. While we don't want every thread to be forcing a TXT commit, IMHO the jbd2 code has it right - wait a short time for more threads to join the TXG and commit, then force the commit. The length of time to wait should be based on the current storage commit time, as jbd2_journal_stop() does, so that it is fast on flash and slower on disks to be most efficient. |
| Comment by Alex Zhuravlev [ 29/Aug/18 ] |
|
the patch I used: @@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data) } LPROC_SEQ_FOPS_RO(zfs_osd_mntdev); +struct osd_sync_cb_data { + ktime_t time; +}; + +static void osd_sync_cb(void *cb_data, int error) +{ + struct osd_sync_cb_data *cb = cb_data; + printk("sync took %llu usec\n", ktime_us_delta(ktime_get(), cb->time)); + OBD_FREE_PTR(cb); +} + static ssize_t lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, size_t count, loff_t *off) @@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, rc = lu_env_init(&env, LCT_LOCAL); if (rc) return rc; + { + struct osd_device *osd = osd_dt_dev(dt); + struct osd_sync_cb_data *cb; + dmu_tx_t *tx; + + OBD_ALLOC_PTR(cb); + cb->time = ktime_get(); + tx = dmu_tx_create(osd->od_os); + dmu_tx_assign(tx, TXG_WAIT); + dmu_tx_callback_register(tx, osd_sync_cb, cb); + dmu_tx_commit(tx); + } i.e. I registered a commit callback and just calc/printk when it's called. txg_wait_synced() asks sync thread to initiate txg commit:
if (tx->tx_sync_txg_waiting < txg)
tx->tx_sync_txg_waiting = txg;
...
cv_broadcast(&tx->tx_sync_more_cv);
|
| Comment by Andreas Dilger [ 03/Sep/18 ] |
|
Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms. |
| Comment by Gerrit Updater [ 03/Sep/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106 |
| Comment by Gerrit Updater [ 08/Sep/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/ |
| Comment by Alex Zhuravlev [ 10/Oct/18 ] |
|
adilger, sorry, missed your question.. I'm using ramdisk. |
| Comment by Alex Zhuravlev [ 24/Jan/19 ] |
|
I think this is a dup of |