[LU-13227] sanityn 16a FAIL: fsx with O_DIRECT failed. Created: 10/Feb/20  Updated: 08/Dec/22  Resolved: 11/Mar/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Wang Shilong (Inactive) Assignee: Wang Shilong (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File sanityn.test_16a.debug_log.tmp.1581518737.log.bz2     File sanityn.test_16a.debug_log.tmp.1581518998.log.bz2     File sanityn.test_16a.debug_log.tmp.1583187625.log.gz     Text File serial.txt     Text File stdout.txt    
Issue Links:
Related
is related to LU-14314 interop: sanityn test 16e fails with ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Size error: expected 0x659000 stat 0x145d12 seek 0x659000
LOG DUMP (849 total operations):

Full logs could be found here:

https://testing-archive.whamcloud.com/gerrit-janitor/6308/testresults/sanityn-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/sanityn.test_16a.test_log.oleg357-client.log

The problem might be related to following cases:

1) Client A try to ftruncate file to file_size.
2) Client B Direct Write (punch hole )to update new_file_size.
3) Client A try to stat file to check whether it could get new_file_size, but got old file_size instead.



 Comments   
Comment by Alex Zhuravlev [ 10/Feb/20 ]

reverting LU-4198 clio: turn on lockless for some kind of IO fixes the problem..

Comment by Wang Shilong (Inactive) [ 10/Feb/20 ]

Yup, that patch looks very likely related, but i am still trying to understand resons.

Comment by Wang Shilong (Inactive) [ 12/Feb/20 ]

bzzz Alex, would you mind sharing your configurations to reproduce the problem? , really strange for me, i could not reproduce this.

Comment by Alex Zhuravlev [ 12/Feb/20 ]

in kvm:
FSTYPE=ldiskfs MDSCOUNT=2 OSTSIZE=2800000 REFORMAT=yes bash sanityn.sh
or FSTYPE=ldiskfs MDSCOUNT=1 OSTSIZE=2800000 REFORMAT=yes bash sanityn.sh

Comment by Alex Zhuravlev [ 12/Feb/20 ]

I can reproduce quite easily..

Comment by Wang Shilong (Inactive) [ 13/Feb/20 ]

Still no lock to reproduce locally.

Looking at existed client debug log, i see Glimpse lock is aborted which might be reason why stat did not get correct inode size:

00010000:2.0:1581153364.494156:0:25749:0:(glimpse.c:91:cl_glimpse_lock()) Glimpsing inode [0x200000402:0x2a:0x0]
18708 00000008:00010000:2.0:1581153364.494169:0:25749:0:(lustre_osc.h:189:osc_enq2ldlm_flags()) flags: a
18709 00000008:00010000:2.0:1581153364.494171:0:25749:0:(osc_lock.c:1263:osc_lock_init()) ### lock ffff8800ad8f9470, osc lock ffff88009de1cec0, flags 0x1002
18710 00010000:00010000:2.0:1581153364.494216:0:25749:0:(ldlm_lock.c:767:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: ?? lock: ffff8800a26fed80/0x59      6038eac042fda6 lrc: 3/1,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18711 00010000:00010000:2.0:1581153364.494220:0:25749:0:(ldlm_request.c:1015:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000 ns: lustre-OST0000-osc-ffff      88009f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 3/1,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1844674407      3709551615) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
00100000:2.0:1581153364.494283:0:3946:0:(client.c:1736:ptlrpc_send_new_req()) Sending RPC req@ffff8800a3aa7b40 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_0      1_01:b19943a4-a4de-4:3946:1657958989738368:192.168.203.157@tcp:101:fsx.0
18713 00010000:00010000:2.0:1581153364.500520:0:3946:0:(ldlm_lock.c:767:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: ?? lock: ffff8800a26fed80/0x596      038eac042fda6 lrc: 6/2,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18714 00010000:00010000:2.0:1581153364.500532:0:3946:0:(ldlm_request.c:631:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-OST0000-osc-ffff880      09f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 6/2,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1844674407370      9551615) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18715 00010000:00010000:2.0:1581153364.500544:0:3946:0:(ldlm_request.c:558:failed_lock_cleanup()) ### setting FL_LOCAL_ONLY | LDLM_FL_FAILED | LDLM_FL_ATOMIC_CB | LDLM_      FL_CBPENDING ns: lustre-OST0000-osc-ffff88009f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 6/2,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0->      18446744073709551615] (req 0->18446744073709551615) flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18716 00010000:00010000:2.0:1581153364.500548:0:3946:0:(ldlm_lock.c:821:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: ?? lock: ffff8800a26fed80/0x596      038eac042fda6 lrc: 6/2,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18717 00010000:00010000:2.0:1581153364.500552:0:3946:0:(ldlm_lock.c:912:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: ?? lock: ffff8800a26fed80/0x5      96038eac042fda6 lrc: 5/1,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1
18718 00000008:00010000:2.0:1581153364.500561:0:3946:0:(osc_lock.c:197:osc_lock_lvb_update()) ### no dlm lock: acquired size 1334546, setting rss=1334546; leaving kms=0      , end=18446744073709551615
Comment by Wang Shilong (Inactive) [ 13/Feb/20 ]

It looks EABORTED is a normal case for glimpse lock, but it is not clear where ofd_intent_policy() return that kind of error, sanityn.sh is using
default debug, might make sense to enable more to debug.

Comment by Alex Zhuravlev [ 13/Feb/20 ]

the second log should be full?

Comment by Gerrit Updater [ 13/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37558
Subject: LU-13227 test: enable trace for sanityn 16a
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bddff67f203e875c3618035e6777c85115103a2a

Comment by Wang Shilong (Inactive) [ 13/Feb/20 ]

bzzz you are right, i missed second log.

Comment by Alex Zhuravlev [ 17/Feb/20 ]

I'm suggesting to revert the patch given the problem behind isn't that obvious.

Comment by Gerrit Updater [ 17/Feb/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37611
Subject: LU-13227 llite: still enqueue lock for DIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: cba494ba69b8aefa060e0f9a09d5c6b4ab7cc7c0

Comment by Gerrit Updater [ 01/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37611/
Subject: LU-13227 LDLM: update LVB if it is server lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3b5c670e65b5f3924e014ab1d030e6f45e2999d3

Comment by Peter Jones [ 01/Mar/20 ]

Landed for 2.14

Comment by Alex Zhuravlev [ 02/Mar/20 ]

still failing on my setup..

Comment by Wang Shilong (Inactive) [ 03/Mar/20 ]

bzzz Would you double check that is not ENOSPC errors? potentially that could be caused by:

https://review.whamcloud.com/#/c/35896/

Comment by Alex Zhuravlev [ 03/Mar/20 ]

I think so.. you can find logs here: https://disk.yandex.com/d/z2s2BUC-M4EIgQ
this is the latest master:

  • bfaa8a0686 2020-03-02 | LU-13314 test: add 56ob to ALWAYS_EXCEPT for now (origin/master, origin/HEAD) [Wang Shilong]
Comment by Wang Shilong (Inactive) [ 03/Mar/20 ]

Hmm..i could not access url logs, would you mind uploading it here?

Comment by Alex Zhuravlev [ 03/Mar/20 ]

sure, please find.

Comment by Gerrit Updater [ 04/Mar/20 ]

Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37790
Subject: LU-13227 ldlm: update lvb before dropping server lock
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 75eafe50f942e1c412b673579e57a311cc46cac0

Comment by Wang Shilong (Inactive) [ 04/Mar/20 ]

bzzz Would you mind applying the above new patch to see if it help fix your problem?
At least i try ram OST with repeated tests 200, did not reproduce.

Thanks in advance!

Comment by Alex Zhuravlev [ 04/Mar/20 ]

sure, it's been running now..

Comment by Alex Zhuravlev [ 04/Mar/20 ]

failed with an assertion:

Lustre: DEBUG MARKER: == replay-single test 20b: write, unlink, eviction, replay (test mds_cleanup_orphans) ================ 17:33:05 (1583325185)
Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000
Lustre: 28814:0:(genops.c:1709:obd_export_evict_by_uuid()) lustre-MDT0000: evicting 8ef1b02c-3fd5-4 at adminstrative request
LustreError: 28814:0:(mdt_lvb.c:176:mdt_dom_lvbo_update()) ASSERTION( env ) failed: 
LustreError: 28814:0:(mdt_lvb.c:176:mdt_dom_lvbo_update()) LBUG
Pid: 28814, comm: lctl 4.18.0 #32 SMP Wed Jan 15 22:22:45 MSK 2020
Call Trace:
 libcfs_call_trace+0x71/0x90 [libcfs]
 lbug_with_loc+0x3e/0x80 [libcfs]
 ? mdt_dom_lvbo_update+0x8ab/0x9c0 [mdt]
 ? ldlm_lock_decref_internal+0x33f/0xb70 [ptlrpc]
 ? ldlm_lock_decref_and_cancel+0x6f/0x130 [ptlrpc]
 ? mdt_dom_discard_data+0x490/0x550 [mdt]
 ? barrier_exit+0x27/0x60 [ptlrpc]
 ? mdd_trans_stop+0x28/0x15d [mdd]
 ? mdt_mfd_close+0x6ab/0x3080 [mdt]
 ? lu_context_fini+0x72/0x180 [obdclass]
 ? mdt_ctxt_add_dirty_flag.isra.0+0x111/0x160 [mdt]
 ? mdt_obd_disconnect+0x3a3/0x510 [mdt]
 ? class_fail_export+0x1ce/0x4e0 [obdclass]
 ? obd_export_evict_by_uuid+0xdb/0x1e0 [obdclass]
 ? lprocfs_evict_client_seq_write+0x1e8/0x2a0 [obdclass]
 ? mdt_mds_evict_client_write+0x417/0x6e0 [mdt]
 ? proc_reg_write+0x35/0x60
 ? __vfs_write+0x1f/0x160
 ? rcu_sync_lockdep_assert+0x9/0x50
 ? __sb_start_write+0x13f/0x1a0
 ? vfs_write+0x183/0x1b0
 ? vfs_write+0xba/0x1b0
 ? ksys_write+0x3d/0xa0
 ? do_syscall_64+0x4b/0x1a0
 ? entry_SYSCALL_64_after_hwframe+0x6a/0xdf
Comment by Wang Shilong (Inactive) [ 04/Mar/20 ]

sorry, need make sure only update LVB with IO context, update patch again, https://review.whamcloud.com/#/c/37790/5
shall fix the problem.

Comment by Wang Shilong (Inactive) [ 05/Mar/20 ]

I've added a simple reproduced test to the patch and verify with/without patch it works.

Comment by Gerrit Updater [ 11/Mar/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37790/
Subject: LU-13227 ofd: update lvb before dropping server lock
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 92d799217aeac4e1e859264d157c62e0f02b83a8

Comment by Alex Zhuravlev [ 08/Dec/22 ]

I'm still hitting this on master:

065500[0] 1670492005.778405 trunc from 0x870187 to 0x19dc98 (0x6d24ee bytes)
Size error: expected 0x94fbcf stat 0x92b000 seek 0x92b000

Generated at Sat Feb 10 02:59:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.