[LU-13227] sanityn 16a FAIL: fsx with O_DIRECT failed. Created: 10/Feb/20 Updated: 08/Dec/22 Resolved: 11/Mar/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Wang Shilong (Inactive) | Assignee: | Wang Shilong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Size error: expected 0x659000 stat 0x145d12 seek 0x659000 Full logs could be found here: The problem might be related to following cases: 1) Client A try to ftruncate file to file_size. |
| Comments |
| Comment by Alex Zhuravlev [ 10/Feb/20 ] |
|
reverting |
| Comment by Wang Shilong (Inactive) [ 10/Feb/20 ] |
|
Yup, that patch looks very likely related, but i am still trying to understand resons. |
| Comment by Wang Shilong (Inactive) [ 12/Feb/20 ] |
|
bzzz Alex, would you mind sharing your configurations to reproduce the problem? , really strange for me, i could not reproduce this. |
| Comment by Alex Zhuravlev [ 12/Feb/20 ] |
|
in kvm: |
| Comment by Alex Zhuravlev [ 12/Feb/20 ] |
|
I can reproduce quite easily.. |
| Comment by Wang Shilong (Inactive) [ 13/Feb/20 ] |
|
Still no lock to reproduce locally. Looking at existed client debug log, i see Glimpse lock is aborted which might be reason why stat did not get correct inode size: 00010000:2.0:1581153364.494156:0:25749:0:(glimpse.c:91:cl_glimpse_lock()) Glimpsing inode [0x200000402:0x2a:0x0] 18708 00000008:00010000:2.0:1581153364.494169:0:25749:0:(lustre_osc.h:189:osc_enq2ldlm_flags()) flags: a 18709 00000008:00010000:2.0:1581153364.494171:0:25749:0:(osc_lock.c:1263:osc_lock_init()) ### lock ffff8800ad8f9470, osc lock ffff88009de1cec0, flags 0x1002 18710 00010000:00010000:2.0:1581153364.494216:0:25749:0:(ldlm_lock.c:767:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: ?? lock: ffff8800a26fed80/0x59 6038eac042fda6 lrc: 3/1,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18711 00010000:00010000:2.0:1581153364.494220:0:25749:0:(ldlm_request.c:1015:ldlm_cli_enqueue()) ### client-side enqueue START, flags 0x1000 ns: lustre-OST0000-osc-ffff 88009f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 3/1,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1844674407 3709551615) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 00100000:2.0:1581153364.494283:0:3946:0:(client.c:1736:ptlrpc_send_new_req()) Sending RPC req@ffff8800a3aa7b40 pname:cluuid:pid:xid:nid:opc:job ptlrpcd_0 1_01:b19943a4-a4de-4:3946:1657958989738368:192.168.203.157@tcp:101:fsx.0 18713 00010000:00010000:2.0:1581153364.500520:0:3946:0:(ldlm_lock.c:767:ldlm_lock_addref_internal_nolock()) ### ldlm_lock_addref(PR) ns: ?? lock: ffff8800a26fed80/0x596 038eac042fda6 lrc: 6/2,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10000000000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18714 00010000:00010000:2.0:1581153364.500532:0:3946:0:(ldlm_request.c:631:ldlm_cli_enqueue_fini()) ### client-side enqueue END (ABORTED) ns: lustre-OST0000-osc-ffff880 09f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 6/2,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->1844674407370 9551615) flags: 0x0 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18715 00010000:00010000:2.0:1581153364.500544:0:3946:0:(ldlm_request.c:558:failed_lock_cleanup()) ### setting FL_LOCAL_ONLY | LDLM_FL_FAILED | LDLM_FL_ATOMIC_CB | LDLM_ FL_CBPENDING ns: lustre-OST0000-osc-ffff88009f100800 lock: ffff8800a26fed80/0x596038eac042fda6 lrc: 6/2,0 mode: --/PR res: [0xf:0x0:0x0].0x0 rrc: 2 type: EXT [0-> 18446744073709551615] (req 0->18446744073709551615) flags: 0x206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18716 00010000:00010000:2.0:1581153364.500548:0:3946:0:(ldlm_lock.c:821:ldlm_lock_decref_internal_nolock()) ### ldlm_lock_decref(PR) ns: ?? lock: ffff8800a26fed80/0x596 038eac042fda6 lrc: 6/2,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18717 00010000:00010000:2.0:1581153364.500552:0:3946:0:(ldlm_lock.c:912:ldlm_lock_decref_internal()) ### do not add lock into lru list ns: ?? lock: ffff8800a26fed80/0x5 96038eac042fda6 lrc: 5/1,0 mode: --/PR res: ?? rrc=?? type: ??? flags: 0x10206400000000 nid: local remote: 0x0 expref: -99 pid: 25749 timeout: 0 lvb_type: 1 18718 00000008:00010000:2.0:1581153364.500561:0:3946:0:(osc_lock.c:197:osc_lock_lvb_update()) ### no dlm lock: acquired size 1334546, setting rss=1334546; leaving kms=0 , end=18446744073709551615 |
| Comment by Wang Shilong (Inactive) [ 13/Feb/20 ] |
|
It looks EABORTED is a normal case for glimpse lock, but it is not clear where ofd_intent_policy() return that kind of error, sanityn.sh is using |
| Comment by Alex Zhuravlev [ 13/Feb/20 ] |
|
the second log should be full? |
| Comment by Gerrit Updater [ 13/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37558 |
| Comment by Wang Shilong (Inactive) [ 13/Feb/20 ] |
|
bzzz you are right, i missed second log. |
| Comment by Alex Zhuravlev [ 17/Feb/20 ] |
|
I'm suggesting to revert the patch given the problem behind isn't that obvious. |
| Comment by Gerrit Updater [ 17/Feb/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37611 |
| Comment by Gerrit Updater [ 01/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37611/ |
| Comment by Peter Jones [ 01/Mar/20 ] |
|
Landed for 2.14 |
| Comment by Alex Zhuravlev [ 02/Mar/20 ] |
|
still failing on my setup.. |
| Comment by Wang Shilong (Inactive) [ 03/Mar/20 ] |
|
bzzz Would you double check that is not ENOSPC errors? potentially that could be caused by: |
| Comment by Alex Zhuravlev [ 03/Mar/20 ] |
|
I think so.. you can find logs here: https://disk.yandex.com/d/z2s2BUC-M4EIgQ
|
| Comment by Wang Shilong (Inactive) [ 03/Mar/20 ] |
|
Hmm..i could not access url logs, would you mind uploading it here? |
| Comment by Alex Zhuravlev [ 03/Mar/20 ] |
|
sure, please find. |
| Comment by Gerrit Updater [ 04/Mar/20 ] |
|
Wang Shilong (wshilong@ddn.com) uploaded a new patch: https://review.whamcloud.com/37790 |
| Comment by Wang Shilong (Inactive) [ 04/Mar/20 ] |
|
bzzz Would you mind applying the above new patch to see if it help fix your problem? Thanks in advance! |
| Comment by Alex Zhuravlev [ 04/Mar/20 ] |
|
sure, it's been running now.. |
| Comment by Alex Zhuravlev [ 04/Mar/20 ] |
|
failed with an assertion: Lustre: DEBUG MARKER: == replay-single test 20b: write, unlink, eviction, replay (test mds_cleanup_orphans) ================ 17:33:05 (1583325185) Lustre: Setting parameter lustre-MDT0000-mdtlov.lov.stripesize in log lustre-MDT0000 Lustre: 28814:0:(genops.c:1709:obd_export_evict_by_uuid()) lustre-MDT0000: evicting 8ef1b02c-3fd5-4 at adminstrative request LustreError: 28814:0:(mdt_lvb.c:176:mdt_dom_lvbo_update()) ASSERTION( env ) failed: LustreError: 28814:0:(mdt_lvb.c:176:mdt_dom_lvbo_update()) LBUG Pid: 28814, comm: lctl 4.18.0 #32 SMP Wed Jan 15 22:22:45 MSK 2020 Call Trace: libcfs_call_trace+0x71/0x90 [libcfs] lbug_with_loc+0x3e/0x80 [libcfs] ? mdt_dom_lvbo_update+0x8ab/0x9c0 [mdt] ? ldlm_lock_decref_internal+0x33f/0xb70 [ptlrpc] ? ldlm_lock_decref_and_cancel+0x6f/0x130 [ptlrpc] ? mdt_dom_discard_data+0x490/0x550 [mdt] ? barrier_exit+0x27/0x60 [ptlrpc] ? mdd_trans_stop+0x28/0x15d [mdd] ? mdt_mfd_close+0x6ab/0x3080 [mdt] ? lu_context_fini+0x72/0x180 [obdclass] ? mdt_ctxt_add_dirty_flag.isra.0+0x111/0x160 [mdt] ? mdt_obd_disconnect+0x3a3/0x510 [mdt] ? class_fail_export+0x1ce/0x4e0 [obdclass] ? obd_export_evict_by_uuid+0xdb/0x1e0 [obdclass] ? lprocfs_evict_client_seq_write+0x1e8/0x2a0 [obdclass] ? mdt_mds_evict_client_write+0x417/0x6e0 [mdt] ? proc_reg_write+0x35/0x60 ? __vfs_write+0x1f/0x160 ? rcu_sync_lockdep_assert+0x9/0x50 ? __sb_start_write+0x13f/0x1a0 ? vfs_write+0x183/0x1b0 ? vfs_write+0xba/0x1b0 ? ksys_write+0x3d/0xa0 ? do_syscall_64+0x4b/0x1a0 ? entry_SYSCALL_64_after_hwframe+0x6a/0xdf |
| Comment by Wang Shilong (Inactive) [ 04/Mar/20 ] |
|
sorry, need make sure only update LVB with IO context, update patch again, https://review.whamcloud.com/#/c/37790/5 |
| Comment by Wang Shilong (Inactive) [ 05/Mar/20 ] |
|
I've added a simple reproduced test to the patch and verify with/without patch it works. |
| Comment by Gerrit Updater [ 11/Mar/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/37790/ |
| Comment by Alex Zhuravlev [ 08/Dec/22 ] |
|
I'm still hitting this on master:
|