Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10740

replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.11.0, Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      2.7 (IEEL) based server, 2.10.58 client

      == replay-single test 2d: setdirstripe replay ======================================================== 00:25:51 (1519604751)
      UUID                   1K-blocks        Used   Available Use% Mounted on
      lustre-MDT0000_UUID       344152        2752      317200   1% /mnt/lustre[MDT:0]
      lustre-MDT0001_UUID       344152        2492      317460   1% /mnt/lustre[MDT:1]
      lustre-OST0000_UUID       664400       17712      597688   3% /mnt/lustre[OST:0]
      lustre-OST0001_UUID       664400       17712      597688   3% /mnt/lustre[OST:1]
      
      filesystem_summary:      1328800       35424     1195376   3% /mnt/lustre
      
      Failing mds1 on fre1209
      Stopping /mnt/lustre-mds1 (opts:) on fre1209
      pdsh@fre1211: fre1209: ssh exited with exit code 1
      reboot facets: mds1
      Failover mds1 to fre1209
      00:26:02 (1519604762) waiting for fre1209 network 900 secs ...
      00:26:02 (1519604762) network interface is UP
      mount facets: mds1
      Starting mds1: -o rw,user_xattr  /dev/vdb /mnt/lustre-mds1
      pdsh@fre1211: fre1209: ssh exited with exit code 1
      pdsh@fre1211: fre1209: ssh exited with exit code 1
      Started lustre-MDT0000
      fre1212: fre1212: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      fre1211: fre1211: executing wait_import_state_mount FULL mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      fre1212: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      fre1211: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 4 sec
      Can't lstat /mnt/lustre/d2d.replay-single: No such file or directory
       replay-single test_2d: @@@@@@ FAIL: checkstat -v  /mnt/lustre/d2d.replay-single check failed 
       
      

      100% reproducible with review-dne-zfs-part-4

      Attachments

        Issue Links

          Activity

            [LU-10740] replay-single test_2d: FAIL: checkstat -v /mnt/lustre/d2d.replay-single check failed

            I think this is a dup of LU-10143

            bzzz Alex Zhuravlev added a comment - I think this is a dup of LU-10143

            adilger, sorry, missed your question.. I'm using ramdisk.

            bzzz Alex Zhuravlev added a comment - adilger , sorry, missed your question.. I'm using ramdisk.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/
            Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 16e92e2d01a71c2a97cae89c70c58abf409c12c0

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33106/ Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4 Project: fs/lustre-release Branch: master Current Patch Set: Commit: 16e92e2d01a71c2a97cae89c70c58abf409c12c0

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106
            Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a7675ca33ec579525b4d39ad114b54ca3b462a62

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33106 Subject: LU-10740 tests: disable tests for replay-dne-zfs-part-4 Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a7675ca33ec579525b4d39ad114b54ca3b462a62

            Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms.

            adilger Andreas Dilger added a comment - Alex, is your testing running with an HDD or SSD or ramdisk? On an HDD I can't see how a TXG commit could ever complete in less than 4x seek time (just for the überblock writes, 2x each at the start and end of the device), which is about 32-40 ms.
            bzzz Alex Zhuravlev added a comment - - edited

            the patch I used:

            @@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data)
             }
             LPROC_SEQ_FOPS_RO(zfs_osd_mntdev);
             
            +struct osd_sync_cb_data {
            +	ktime_t time;
            +};
            +
            +static void osd_sync_cb(void *cb_data, int error)
            +{
            +	struct osd_sync_cb_data *cb = cb_data;
            +	printk("sync took %llu usec\n", ktime_us_delta(ktime_get(), cb->time));
            +	OBD_FREE_PTR(cb);
            +}
            +
             static ssize_t
             lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
             				size_t count, loff_t *off)
            @@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer,
             	rc = lu_env_init(&env, LCT_LOCAL);
             	if (rc)
             		return rc;
            +	{
            +		struct osd_device  *osd = osd_dt_dev(dt);
            +		struct osd_sync_cb_data *cb;
            +		dmu_tx_t *tx;
            +
            +		OBD_ALLOC_PTR(cb);
            +		cb->time = ktime_get();
            +		tx = dmu_tx_create(osd->od_os);
            +		dmu_tx_assign(tx, TXG_WAIT);
            +		dmu_tx_callback_register(tx, osd_sync_cb, cb);
            +		dmu_tx_commit(tx);
            +	}
            

            i.e. I registered a commit callback and just calc/printk when it's called.

            txg_wait_synced() asks sync thread to initiate txg commit:

            	if (tx->tx_sync_txg_waiting < txg)
            		tx->tx_sync_txg_waiting = txg;
            ...
            		cv_broadcast(&tx->tx_sync_more_cv);
            

             

            bzzz Alex Zhuravlev added a comment - - edited the patch I used: @@ -276,6 +276,17 @@ static int zfs_osd_mntdev_seq_show(struct seq_file *m, void *data) } LPROC_SEQ_FOPS_RO(zfs_osd_mntdev); +struct osd_sync_cb_data { + ktime_t time; +}; + + static void osd_sync_cb(void *cb_data, int error) +{ + struct osd_sync_cb_data *cb = cb_data; + printk( "sync took %llu usec\n" , ktime_us_delta(ktime_get(), cb->time)); + OBD_FREE_PTR(cb); +} + static ssize_t lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, size_t count, loff_t *off) @@ -288,6 +299,19 @@ lprocfs_osd_force_sync_seq_write(struct file *file, const char __user *buffer, rc = lu_env_init(&env, LCT_LOCAL); if (rc) return rc; + { + struct osd_device *osd = osd_dt_dev(dt); + struct osd_sync_cb_data *cb; + dmu_tx_t *tx; + + OBD_ALLOC_PTR(cb); + cb->time = ktime_get(); + tx = dmu_tx_create(osd->od_os); + dmu_tx_assign(tx, TXG_WAIT); + dmu_tx_callback_register(tx, osd_sync_cb, cb); + dmu_tx_commit(tx); + } i.e. I registered a commit callback and just calc/printk when it's called. txg_wait_synced() asks sync thread to initiate txg commit: if (tx->tx_sync_txg_waiting < txg) tx->tx_sync_txg_waiting = txg; ... cv_broadcast(&tx->tx_sync_more_cv);  

            People

              laisiyao Lai Siyao
              egryaznova Elena Gryaznova
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: