[LU-763] DIO write does not force sync journal commit on OST Created: 14/Oct/11  Updated: 29/May/17  Resolved: 29/May/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Eric Barton (Inactive) Assignee: Oleg Drokin
Resolution: Cannot Reproduce Votes: 1
Labels: None

Severity: 2
Rank (Obsolete): 10463

 Description   

using sgp_dd to write to a Lustre file with async journal commits enabled, the JBD history shows that journal entries are being committed only at the timeout interval, and not synchronously on each write. This will cause data loss if the OST crashes after the data is written but before the journal is flushed.



 Comments   
Comment by Peter Jones [ 14/Oct/11 ]

Oleg

can you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 14/Oct/11 ]

Eric, the clients will do bulk RPC recovery if the io does not commit, because the transno for the RPC is not reported as committed yet.

This is needed for ZFS also, which would suffer terribly if it had to do sync writes w/o the ZIL.

Comment by Johann Lombardi (Inactive) [ 14/Oct/11 ]

hm, weird, OBD_BRW_ASYNC is not set for dios, so we should really be triggering a journal flush & wait.

Comment by Eric Barton (Inactive) [ 14/Oct/11 ]

Sorry my bad, but forgot to say the whole point of this bug which is that sgp_dd had "dio=1" set - i.e. the client was doing O_DIRECT writes which should force sync journal writes since the client cannot replay such writes.

Comment by Oleg Drokin [ 14/Oct/11 ]

I just performed a number of tests and I don't think I can reproduce this at all. Tried on 1.8.6-wc and 1.8.7-rc1

Here is my testcase:
mount lustre, echo 0 to /proc/fs/lustre/obdfilter/*/sync_journal
dd if=/dev/zero of=/mnt/lustre/file bs=1024k count=100 oflags=direct # this is directio write
dd if=/dev/zero of=/mnt/lustre/file1 bs=1024k count=100 # this is normal write

  1. both writes happened to be to the same ost, now check jornal history and we see first write had a bunch of small transactions for directio write followed by just a single transaction for the non-directio write.
    R/C tid wait run lock flush log hndls block inlog ctime write drop close
    R 21 0 0 0 0 0 1 4 5
    R 22 0 0 0 0 0 1 4 5
    R 23 0 0 0 0 0 1 4 5
    R 24 0 0 0 0 0 1 4 5
    R 25 0 10 0 0 0 1 4 5
    R 26 0 0 0 0 0 1 4 5
    R 27 0 0 0 0 0 1 4 5
    R 28 0 10 0 0 0 1 4 5
    R 29 0 0 0 0 0 1 4 5
    R 30 0 10 0 0 0 1 4 5
    R 31 0 0 0 0 0 1 4 5
    R 32 0 10 0 0 0 1 4 5
    R 33 0 0 0 0 0 1 4 5
    R 34 0 10 0 0 0 1 4 5
    R 35 0 0 0 0 0 1 4 5
    R 36 0 10 0 0 0 1 4 5
    R 37 0 10 0 0 0 1 4 5
    R 38 0 0 0 0 0 1 4 5
    R 39 0 0 0 0 0 1 4 5
    R 40 0 10 0 0 0 1 4 5
    R 41 0 10 0 0 0 1 4 5
    R 42 0 0 0 0 0 1 4 5
    R 43 0 0 0 0 0 1 4 5
    R 44 0 10 0 0 0 1 4 5
    R 45 0 0 0 0 0 1 4 5
    R 46 0 10 0 0 0 1 4 5
    R 47 0 10 0 0 0 1 4 5
    R 48 0 0 0 0 0 1 4 5
    R 49 0 0 0 0 0 1 4 5
    R 50 0 0 0 0 0 1 4 5
    R 51 0 10 0 0 0 1 4 5
    R 52 0 0 0 0 0 1 4 5
    R 53 0 0 0 0 10 1 4 5
    R 54 0 10 0 0 0 1 4 5
    R 55 0 0 0 0 0 1 4 5
    R 56 0 0 0 0 0 1 4 5
    R 57 0 10 0 0 0 1 4 5
    R 58 0 0 0 0 10 1 4 5
    R 59 0 0 0 0 0 1 4 5
    R 60 0 0 0 0 0 1 4 5
    R 61 0 0 0 0 0 1 4 5
    R 62 0 10 0 0 0 1 4 5
    R 63 0 0 0 0 100 1 4 5
    R 64 0 0 0 0 10 1 4 5
    R 65 0 0 0 0 0 1 4 5
    R 66 0 10 0 0 0 1 4 5
    R 67 0 10 0 0 0 1 4 5
    R 68 0 0 0 0 10 1 4 5
    R 69 0 0 0 0 0 1 4 5
    R 70 0 0 0 0 0 1 4 5
    R 71 0 10 0 0 0 1 4 5
    R 72 0 10 0 0 0 1 4 5
    R 73 0 0 0 0 0 1 4 5
    R 74 0 10 0 0 0 1 4 5
    R 75 0 0 0 0 0 1 4 5
    R 76 0 0 0 0 0 1 4 5
    R 77 0 0 0 0 0 1 4 5
    R 78 0 0 0 0 0 1 4 5
    R 79 0 0 0 0 10 1 4 5
    R 80 0 10 0 0 0 1 4 5
    R 81 0 0 0 0 0 1 4 5
    R 82 0 0 0 0 0 1 4 5
    R 83 0 10 0 0 0 1 4 5
    R 84 0 0 0 0 10 1 4 5
    R 85 0 0 0 0 0 1 4 5
    R 86 0 0 0 0 0 1 4 5
    R 87 0 0 0 0 0 1 4 5
    R 88 0 10 0 0 0 1 4 5
    R 89 0 10 0 0 0 1 4 5
    R 90 0 0 0 0 0 1 4 5
    R 91 0 0 0 0 0 1 4 5
    R 92 0 10 0 0 0 1 4 5
    R 93 0 0 0 0 0 1 4 5
    R 94 0 10 0 0 0 1 4 5
    R 95 0 0 0 0 0 1 4 5
    R 96 0 0 0 0 0 1 4 5
    R 97 0 0 0 0 0 1 4 5
    R 98 0 0 0 0 0 1 4 5
    R 99 0 10 0 0 0 1 4 5
    R 100 0 0 0 0 0 1 4 5
    R 101 0 0 0 0 0 1 4 5
    R 102 0 0 0 0 0 1 4 5
    R 103 0 10 0 0 0 1 4 5
    R 104 0 0 0 0 0 1 4 5
    R 105 0 10 0 0 0 1 4 5
    R 106 0 0 0 0 0 1 4 5
    R 107 0 0 0 0 0 1 4 5
    R 108 0 0 0 0 10 1 4 5
    R 109 0 0 0 0 0 1 4 5
    R 110 0 10 0 0 0 1 4 5
    R 111 0 0 0 0 0 1 4 5
    R 112 0 10 0 0 0 1 4 5
    R 113 0 0 0 0 0 1 4 5
    R 114 0 0 0 0 0 1 4 5
    R 115 0 10 0 0 0 1 4 5
    R 116 0 0 0 0 0 1 4 5
    R 117 0 0 0 0 0 1 4 5
    R 118 0 10 0 0 0 1 4 5
    R 119 0 5010 0 0 10 117 6 7
Comment by Eric Barton (Inactive) [ 16/Oct/11 ]

Shame there aren't timestamps in the transaction log to prove that the non-dio I/O all happened after the last small transaction.

The program doing the writing when we observed DIO writes not doing sync journal commits was sgp_dd running as follows...

sgp_dd time=1 bs=512 bpt=2048 thr=16 dio=1 of=/mnt/lustre/ost0/tf2.out if=/dev/zero count=8192000 &

...and the file was in a directory where the default layout was 1 stripe on ost0. I can't imagine why any of that would make a difference, but I don't think we can rule out that it doesn't.

For the full evidence, look in the commitT-OUT5sec.tar.bz2 attached to NTAP-3. The transaction history is printed in writeJBDstats.before.out and writeJBDstats.after.out, and you can compare the 2 to see what occurred during the test run. See runWrite8OSTs.ksh for what got run.

<Note - edited this comment because commitT-OUT10sec.tar.bz2 didn't set dio=1 in the sgp_dd commands run>

Comment by Oleg Drokin [ 16/Oct/11 ]

I just checked sgp_dd man page here: http://linux.die.net/man/8/sgp_dd

The man page tells me that sgp_dd defaults to non-direct io and if you want direct io you must pass in dio=1 flag that seems to be absent in the command line provided.

Regarding the timestamp in transaction history, you can fully trust me that I did check the history in between the dd commands (and I did a sync before too) and at the completion of directio dd there were only short transactions in the log, the long one was added after the normal io dd was run.

Comment by Andreas Dilger [ 29/May/17 ]

Close old ticket.

Generated at Sat Feb 10 01:10:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.