[LU-10740] replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0
Labels:
- dne
- zfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

2.7 (IEEL) based server, 2.10.58 client

== replay-single test 2d: setdirstripe replay ======================================================== 00:25:51 (1519604751)
UUID                   1K-blocks        Used   Available Use% Mounted on
lustre-MDT0000_UUID       344152        2752      317200   1% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       344152        2492      317460   1% /mnt/lustre[MDT:1]
lustre-OST0000_UUID       664400       17712      597688   3% /mnt/lustre[OST:0]
lustre-OST0001_UUID       664400       17712      597688   3% /mnt/lustre[OST:1]

filesystem_summary:      1328800       35424     1195376   3% /mnt/lustre

Failing mds1 on fre1209
Stopping /mnt/lustre-mds1 (opts:) on fre1209
pdsh@fre1211: fre1209: ssh exited with exit code 1
reboot facets: mds1
Failover mds1 to fre1209
00:26:02 (1519604762) waiting for fre1209 network 900 secs ...
00:26:02 (1519604762) network interface is UP
mount facets: mds1
Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/lustre-mds1
pdsh@fre1211: fre1209: ssh exited with exit code 1
pdsh@fre1211: fre1209: ssh exited with exit code 1
Started lustre-MDT0000
fre1212: fre1212: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
fre1211: fre1211: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
Can't lstat /mnt/lustre/d2d.replay-single: No such file or directory
 replay-single test_2d: @@@@@@ FAIL: checkstat -v  /mnt/lustre/d2d.replay-single check failed

100% reproducible with review-dne-zfs-part-4

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

5a935964f72e62d872670430.zip
424 kB
28/Feb/18 4:06 PM

Issue Links

duplicates

LU-10143 LBUG dt_object.h:2166:dt_declare_record_write

Resolved

is duplicated by

LU-9157 replay-single test_80c: rmdir failed

Resolved

is related to

LU-11336 replay-single test 80d hangs on MDT unmount

Open

LU-11366 replay-single timeout test 80f: rm: cannot remove '/mnt/lustre/d80f.replay-single/remote_dir': Input/output error

Resolved

mentioned in: Page No Confluence page found with the given URL.; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(6 mentioned in)

Activity

[LU-10740] replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed

Alex Zhuravlev added a comment - 24/Jan/19 6:43 AM

I think this is a dup of ~~LU-10143~~

Alex Zhuravlev added a comment - 24/Jan/19 6:43 AM I think this is a dup of LU-10143

Alex Zhuravlev added a comment - 10/Oct/18 4:19 PM

adilger, sorry, missed your question.. I'm using ramdisk.

Alex Zhuravlev added a comment - 10/Oct/18 4:19 PM adilger , sorry, missed your question.. I'm using ramdisk.

Gerrit Updater added a comment - 08/Sep/18 5:33 AM

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/
Subject: ~~LU-10740~~ tests: disable tests for replay-dne-zfs-part-4
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 16e92e2d01a71c2a97cae89c70c58abf409c12c0

Gerrit Updater added a comment - 08/Sep/18 5:33 AM Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/ Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 16e92e2d01a71c2a97cae89c70c58abf409c12c0

Gerrit Updater added a comment - 03/Sep/18 11:21 PM

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106
Subject: ~~LU-10740~~ tests: disable tests for replay-dne-zfs-part-4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a7675ca33ec579525b4d39ad114b54ca3b462a62

Gerrit Updater added a comment - 03/Sep/18 11:21 PM Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106 Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a7675ca33ec579525b4d39ad114b54ca3b462a62

Andreas Dilger added a comment - 03/Sep/18 10:27 PM

Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms.

Andreas Dilger added a comment - 03/Sep/18 10:27 PM Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms.

Alex Zhuravlev added a comment - 29/Aug/18 3:31 AM - edited

the patch I used:

@@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data)
 }
 LPROC_SEQ_FOPS_RO(zfs_osd_mntdev);
 
+struct osd_sync_cb_data {
+	ktime_t time;
+};
+
+static void osd_sync_cb(void *cb_data, int error)
+{
+	struct osd_sync_cb_data *cb = cb_data;
+	printk("sync took %llu usec\n", ktime_us_delta(ktime_get(), cb->time));
+	OBD_FREE_PTR(cb);
+}
+
 static ssize_t
 lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
 				size_t count, loff_t *off)
@@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
 	rc = lu_env_init(&env, LCT_LOCAL);
 	if (rc)
 		return rc;
+	{
+		struct osd_device  *osd = osd_dt_dev(dt);
+		struct osd_sync_cb_data *cb;
+		dmu_tx_t *tx;
+
+		OBD_ALLOC_PTR(cb);
+		cb->time = ktime_get();
+		tx = dmu_tx_create(osd->od_os);
+		dmu_tx_assign(tx, TXG_WAIT);
+		dmu_tx_callback_register(tx, osd_sync_cb, cb);
+		dmu_tx_commit(tx);
+	}

i.e. I registered a commit callback and just calc/printk when it's called.

txg_wait_synced() asks sync thread to initiate txg commit:

	if (tx->tx_sync_txg_waiting < txg)
		tx->tx_sync_txg_waiting = txg;
...
		cv_broadcast(&tx->tx_sync_more_cv);

Alex Zhuravlev added a comment - 29/Aug/18 3:31 AM - edited the patch I used: @@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data) } LPROC_SEQ_FOPS_RO(zfs_osd_mntdev); +struct osd_sync_cb_data { + ktime_t time; +}; + + static void osd_sync_cb(void *cb_data, int error) +{ + struct osd_sync_cb_data *cb = cb_data; + printk( "sync took %llu usec\n" , ktime_us_delta(ktime_get(), cb->time)); + OBD_FREE_PTR(cb); +} + static ssize_t lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, size_t count, loff_t *off) @@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, rc = lu_env_init(&env, LCT_LOCAL); if (rc) return rc; + { + struct osd_device *osd = osd_dt_dev(dt); + struct osd_sync_cb_data *cb; + dmu_tx_t *tx; + + OBD_ALLOC_PTR(cb); + cb->time = ktime_get(); + tx = dmu_tx_create(osd->od_os); + dmu_tx_assign(tx, TXG_WAIT); + dmu_tx_callback_register(tx, osd_sync_cb, cb); + dmu_tx_commit(tx); + } i.e. I registered a commit callback and just calc/printk when it's called. txg_wait_synced() asks sync thread to initiate txg commit: if (tx->tx_sync_txg_waiting < txg) tx->tx_sync_txg_waiting = txg; ... cv_broadcast(&tx->tx_sync_more_cv);

replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates