[LU-10740] replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed Created: 28/Feb/18  Updated: 26/Feb/19  Resolved: 26/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Elena Gryaznova Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: dne, zfs

Attachments: Zip Archive 5a935964f72e62d872670430.zip    
Issue Links:
Duplicate
duplicates LU-10143 LBUG dt_object.h:2166:dt_declare_reco... Resolved
is duplicated by LU-9157 replay-single test_80c: rmdir failed Resolved
Related
is related to LU-11336 replay-single test 80d hangs on MDT u... Open
is related to LU-11366 replay-single timeout test 80f: rm: c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

2.7 (IEEL) based server, 2.10.58 client

== replay-single test 2d: setdirstripe replay ======================================================== 00:25:51 (1519604751)
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       344152        2752      317200   1% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       344152        2492      317460   1% /mnt/lustre[MDT:1]
lustre-OST0000_UUID       664400       17712      597688   3% /mnt/lustre[OST:0]
lustre-OST0001_UUID       664400       17712      597688   3% /mnt/lustre[OST:1]

filesystem_summary:      1328800       35424     1195376   3% /mnt/lustre

Failing mds1 on fre1209
Stopping /mnt/lustre-mds1 (opts:) on fre1209
pdsh@fre1211: fre1209: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1209
00:26:02 (1519604762) waiting for fre1209 network 900 secs ...
00:26:02 (1519604762) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/lustre-mds1
pdsh@fre1211: fre1209: ssh exited with exit code 1
pdsh@fre1211: fre1209: ssh exited with exit code 1
Started lustre-MDT0000
fre1212: fre1212: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
fre1211: fre1211: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
Can't lstat /mnt/lustre/d2d.replay-single: No such file or directory
 replay-single test_2d: @@@@@@ FAIL: checkstat -v  /mnt/lustre/d2d.replay-single check failed 
 

100% reproducible with review-dne-zfs-part-4



 Comments   
Comment by Sarah Liu [ 15/Mar/18 ]

also hit on master branch, zfs DNE

https://testing.hpdd.intel.com/test_sets/d05367b0-2799-11e8-9e0e-52540065bddc

mds dmesg

[64823.580533] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 2d: setdirstripe replay ======================================================== 17:06:03 \(1520960763\)
[64823.808722] Lustre: DEBUG MARKER: == replay-single test 2d: setdirstripe replay ======================================================== 17:06:03 (1520960763)
[64824.017795] Lustre: DEBUG MARKER: sync; sync; sync
[64825.224138] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[64825.565253] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
[64826.042689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[64826.212687] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[64826.695274] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[64827.035062] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[64827.574756] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
lctl dl | grep ' ST ' || true
[64827.930760] Lustre: DEBUG MARKER: ! zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
 grep -q ^lustre-mdt1/ /proc/mounts ||
 zpool export lustre-mdt1
[64838.339263] Lustre: DEBUG MARKER: hostname
[64838.749896] Lustre: DEBUG MARKER: lsmod | grep zfs >&/dev/null || modprobe zfs;
 zpool list -H lustre-mdt1 >/dev/null 2>&1 ||
 zpool import -f -o cachefile=none -o failmode=panic -d /dev/lvm-Role_MDS lustre-mdt1
[64839.417169] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1
[64839.751101] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre lustre-mdt1/mdt1 /mnt/lustre-mds1
[64841.981159] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
[64842.324523] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
[64842.971945] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64842.974147] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.192600] Lustre: DEBUG MARKER: onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.192672] Lustre: DEBUG MARKER: onyx-43vm8.onyx.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[64843.395032] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
[64843.733809] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]\{3}[0-9]\{4}'
[64844.079675] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null | grep -E ':[a-zA-Z]\{3}[0-9]\{4}'
[64844.425889] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname lustre-mdt1/mdt1 2>/dev/null
[64845.202422] Lustre: lustre-MDT0000: Will be in recovery for at least 1:00, or until 5 clients reconnect
[64845.389052] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm5.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.389056] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-43vm6.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.655584] Lustre: DEBUG MARKER: onyx-43vm5.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64845.687723] Lustre: DEBUG MARKER: onyx-43vm6.onyx.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[64846.599842] LustreError: 29449:0:(llog.c:1012:llog_write_rec()) lustre-MDT0000-osd: loghandle ffff88004c471e00 with no header
[64846.601307] LustreError: 29449:0:(llog_cat.c:500:llog_cat_add_rec()) llog_write_rec -71: lh=ffff88004c471e00
[64846.602541] LustreError: 29449:0:(update_trans.c:994:top_trans_stop()) lustre-MDT0000-mdtlov: write updates failed: rc = -71
[64846.603937] LustreError: 29449:0:(tgt_lastrcvd.c:1226:tgt_last_rcvd_update()) lustre-MDT0000: replay transno 38654705669 failed: rc = -71
[64846.613367] Lustre: lustre-MDT0000: disconnecting 1 stale clients
[64846.614170] LustreError: 29449:0:(tgt_grant.c:248:tgt_grant_sanity_check()) mdt_obd_disconnect: tot_granted 2097152 != fo_tot_granted 4194304
[64847.016802] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 1 sec
[64847.244576] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 1 sec
[64854.015289] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 8 sec
[64854.220999] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 8 sec
[64855.647689] Lustre: DEBUG MARKER: /usr/sbin/lctl mark replay-single test_2d: @@@@@@ FAIL: checkstat -v \/mnt\/lustre\/d2d.replay-single check failed 
[64855.876070] Lustre: DEBUG MARKER: replay-single test_2d: @@@@@@ FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed

 

Comment by Andreas Dilger [ 08/Aug/18 ]

The error being returned is -71 = -EREMOTE, which suggests that there is some problem sending the RPC to the wrong MDT. However, that may also be some kind of fallout from "loghandle XXX with no header", which implies that the llogfile is zero length and wasn't written correctly? It's hard to see why the -EREMOTE problem could relate to only DNE+ZFS, while the second one seems more plausible.

Comment by Andreas Dilger [ 08/Aug/18 ]

Lai, could you please take a look? James reports that this is failing 100% of the time with review-dne-zfs-part-4, which is currently an optional test run, but we want to make it a required test.

Comment by Gerrit Updater [ 10/Aug/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/32976
Subject: LU-10740 test: test replay-single 2d with full debug
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 958da17e9abf0504a802963ba8a7fa0a9afc075b

Comment by Lai Siyao [ 22/Aug/18 ]

The failed call trace is:

 llog_add()
-> llog_cat_add_rec()
 -> llog_cat_new_log()
  -> llog_osd_create()
   -> osd_create()
    -> osd_zap_add()
      return -EEXIST

but loghandle->lgh_hdr is NULL, which caused llog_write_rec() return -EPROTO (-71). But I don't understand why it fails on ZFS only.

Comment by Lai Siyao [ 24/Aug/18 ]

I found something strange in the log:

40000000:00000040:1.0:1535051095.957588:0:12953:0:(fid_handler.c:348:__seq_server_alloc_meta()) srv-lustre-MDT0000: Allocated meta-sequence [0x0000000200001b70-0x0000000200001b71]:0:mdt
40000000:00000040:1.0:1535051095.957598:0:12953:0:(fid_request.c:400:seq_client_alloc_fid()) cli-lustre-MDT0000: Allocated FID [0x200001b70:0x1:0x0]
00000004:00000040:1.0:1535051095.957601:0:12953:0:(lod_object.c:1984:lod_prep_md_striped_create()) Get idx 0, for stripe 0 [0x200001b70:0x1:0x0]
00000004:00000040:1.0:1535051095.957607:0:12953:0:(mdt_handler.c:5324:mdt_object_init()) object init, fid = [0x200001b70:0x1:0x0]
00000001:00000002:1.0:1535051095.958937:0:12953:0:(linkea.c:173:linkea_add_buf()) New link_ea name '[0x2000013a0:0xb:0x0]:[0x200001b70:0x1:0x0]:0' is added
00000020:00000040:1.0:1535051095.960470:0:12953:0:(update_records.c:240:update_records_update_pack()) 4th [0x200001b70:0x1:0x0] create param_count = 7
00000040:00000001:0.0:1535051123.894990:0:16387:0:(llog_osd.c:1225:llog_osd_open()) Process entered
00000040:00000001:0.0:1535051123.894991:0:16387:0:(obd_class.h:911:obd_fid_alloc()) Process entered
40000000:00000040:0.0:1535051123.941029:0:16387:0:(fid_request.c:400:seq_client_alloc_fid()) cli-lustre-MDT0000: Allocated FID [0x200001b70:0x1:0x0]

The scenario is as below:
1. setdirstripe allocated [0x200001b70:0x1:0x0] for its first stripe on MDT0, and it also allocated a new meta sequence.
2. replay barrier
3. replay setdirstripe, MDT0 got update logs from MDT1, and it needed to create an update log, now it allocated fid for this update log, but it allocated the same meta sequence as step 1, and allocated the same fid as the first stripe.
4. llog_osd_create() > osd_create() -> osd_zap_add() found the fid in OI, so it skipped llog_init_handle(), which cause loghandle>lgh_hdr not set, and finally llog_write_rec() return -EPROTO.

I'm not sure whether it's related with ZFS, because seq_store_update() has updated FLDB with th_sync set.

Comment by Andreas Dilger [ 24/Aug/18 ]

Is it possible the MDT is running with osd_object_sync_delay_us=0? This was being proposed in a recent patch, but I didn't think it was landed yet, and was initially limited to a single test.

Comment by Lai Siyao [ 26/Aug/18 ]

It doesn't look to be related with osd_object_sync_delay_us. I added some debug log, the result shows th_sync doesn't work on ZFS, the debug log shows seq file was sync written (osd_trans_commit_cb() was called before osd_trans_stop()), but after replay barrier, server read stale seq file on start up. I also tweaked the code to do create in sync mode, the result also shows file not exist after restart.

Is there a simple way to verify this?

Comment by Andreas Dilger [ 26/Aug/18 ]

One problem we have with ZFS TXG sync is that this doesn't force a commit, only block the thread until the TXG commit happens. That should be equivalent from the point of view of the client, but it is possible the test doesn't expect this?

Until we have ZIL support, it would be nice to have a way to force TXG commit, especially with flash devices that have no seek latency.

Comment by Lai Siyao [ 27/Aug/18 ]

FID seq file is updated in sync mode, so upon MDS fail, it won't allocate duplicate FID.

replay-single.sh 2d failed because it happens to allocate a new super sequence before fail, If it's run separately it can pass.

Comment by Alex Zhuravlev [ 28/Aug/18 ]

Andreas, I added a simple debugging data - print time it takes to commit TXG upon enforced commit and initiated that many times during sanity.sh:

sync took 8276 usec
sync took 26489 usec
sync took 15471 usec
sync took 8526 usec
sync took 36187 usec
sync took 16634 usec
sync took 283 usec
sync took 8142 usec
sync took 7335 usec
sync took 609 usec
sync took 641 usec
sync took 660 usec
sync took 9278 usec
sync took 396 usec
sync took 17179 usec
sync took 401 usec
sync took 455 usec
sync took 403 usec
sync took 532 usec
sync took 424 usec
sync took 9057 usec
sync took 444 usec
sync took 11116 usec
sync took 16007 usec
sync took 16611 usec
sync took 680 usec
sync took 240 usec
sync took 289 usec
sync took 256 usec
sync took 289 usec
sync took 311 usec
sync took 7353 usec
sync took 285 usec
sync took 9369 usec
sync took 13867 usec
sync took 275 usec
sync took 6440 usec
sync took 327 usec
sync took 14827 usec
sync took 15640 usec
sync took 17490 usec
sync took 113 usec
sync took 524 usec
sync took 268 usec
sync took 375 usec
sync took 15034 usec
sync took 297 usec
sync took 247 usec
sync took 401 usec
sync took 9492 usec

Comment by Andreas Dilger [ 28/Aug/18 ]

Alex, what are you using to force a TXG commit in your tests, and are you sure that the commit is actually making it to disk? It seems very fast in some cases, 0.1ms for a commit...

static int osd_trans_stop(const struct lu_env *env, struct dt_device *dt,
                          struct thandle *th)
{
        if (sync) {
                if (osd_txg_sync_delay_us < 0)
                        txg_wait_synced(dmu_objset_pool(osd->od_os), txg);
                else
                        udelay(osd_txg_sync_delay_us);
        }

My understanding is that txg_wait_synced() will just wait for the TXG commit to happen, it can't necessarily force a TXG commit like jbd2_journal_stop() can. While we don't want every thread to be forcing a TXT commit, IMHO the jbd2 code has it right - wait a short time for more threads to join the TXG and commit, then force the commit. The length of time to wait should be based on the current storage commit time, as jbd2_journal_stop() does, so that it is fast on flash and slower on disks to be most efficient.

Comment by Alex Zhuravlev [ 29/Aug/18 ]

the patch I used:

@@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data)
 }
 LPROC_SEQ_FOPS_RO(zfs_osd_mntdev);
 
+struct osd_sync_cb_data {
+	ktime_t time;
+};
+
+static void osd_sync_cb(void *cb_data, int error)
+{
+	struct osd_sync_cb_data *cb = cb_data;
+	printk("sync took %llu usec\n", ktime_us_delta(ktime_get(), cb->time));
+	OBD_FREE_PTR(cb);
+}
+
 static ssize_t
 lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
 				size_t count, loff_t *off)
@@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
 	rc = lu_env_init(&env, LCT_LOCAL);
 	if (rc)
 		return rc;
+	{
+		struct osd_device  *osd = osd_dt_dev(dt);
+		struct osd_sync_cb_data *cb;
+		dmu_tx_t *tx;
+
+		OBD_ALLOC_PTR(cb);
+		cb->time = ktime_get();
+		tx = dmu_tx_create(osd->od_os);
+		dmu_tx_assign(tx, TXG_WAIT);
+		dmu_tx_callback_register(tx, osd_sync_cb, cb);
+		dmu_tx_commit(tx);
+	}

i.e. I registered a commit callback and just calc/printk when it's called.

txg_wait_synced() asks sync thread to initiate txg commit:

	if (tx->tx_sync_txg_waiting < txg)
		tx->tx_sync_txg_waiting = txg;
...
		cv_broadcast(&tx->tx_sync_more_cv);

 

Comment by Andreas Dilger [ 03/Sep/18 ]

Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms.

Comment by Gerrit Updater [ 03/Sep/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106
Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a7675ca33ec579525b4d39ad114b54ca3b462a62

Comment by Gerrit Updater [ 08/Sep/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/
Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16e92e2d01a71c2a97cae89c70c58abf409c12c0

Comment by Alex Zhuravlev [ 10/Oct/18 ]

adilger, sorry, missed your question.. I'm using ramdisk.

Comment by Alex Zhuravlev [ 24/Jan/19 ]

I think this is a dup of LU-10143

Generated at Sat Feb 10 02:37:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.