[LU-16264] assertion in osp_precreate_send req->rq_transno == 0 Created: 25/Oct/22  Updated: 25/Oct/23

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.2, Lustre 2.15.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oleg Drokin Assignee: Sergey Cheremencev
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-15671 conf-sanity test_30b: MDS assertion i... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There's a regular crash in conf-sanity test 30a (janitor) or 30b (maloo) that unfolds like this https://testing.whamcloud.com/test_sets/683c15ea-ec59-46d0-89c1-7f10d715f355

[ 5673.591855] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-60vm1.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
[ 5673.592761] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-60vm2.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
[ 5673.882066] Lustre: DEBUG MARKER: onyx-60vm1.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
[ 5673.948710] Lustre: DEBUG MARKER: onyx-60vm2.onyx.whamcloud.com: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
[ 5679.785437] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre-OST0000.failover.node='10.240.25.54@tcp'
[ 5684.257444] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param -d lustre-OST0000.failover.node
[ 5686.387703] LustreError: 167-0: lustre-OST0000-osc-MDT0002: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[ 5686.390343] LustreError: Skipped 1 previous similar message
[ 5686.391768] Lustre: lustre-OST0000-osc-MDT0002: Connection restored to 10.240.25.54@tcp (at 10.240.25.54@tcp)
[ 5686.393594] Lustre: Skipped 1 previous similar message
[ 5686.576332] LustreError: 15573:0:(osp_precreate.c:684:osp_precreate_send()) ASSERTION( req->rq_transno == 0 ) failed: 
[ 5686.578284] LustreError: 15573:0:(osp_precreate.c:684:osp_precreate_send()) LBUG
[ 5686.579611] Pid: 15573, comm: osp-pre-0-2 3.10.0-1160.76.1.el7_lustre.ddn17.x86_64 #1 SMP Fri Oct 14 20:32:07 UTC 2022
[ 5686.581497] Call Trace:
[ 5686.582054] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
[ 5686.582998] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5686.583926] [<0>] osp_precreate_send+0x1106/0x11b0 [osp]
[ 5686.584908] [<0>] osp_precreate_thread+0x6ba/0x13b0 [osp]
[ 5686.585956] [<0>] kthread+0xd1/0xe0
[ 5686.586645] [<0>] ret_from_fork_nospec_begin+0x21/0x21
[ 5686.587621] [<0>] 0xfffffffffffffffe
[ 5686.588300] Kernel panic - not syncing: LBUG
[ 5686.589082] CPU: 0 PID: 15573 Comm: osp-pre-0-2 Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.76.1.el7_lustre.ddn17.x86_64 #1
[ 5686.591324] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 5686.592332] Call Trace:
[ 5686.592803]  [<ffffffff875865c9>] dump_stack+0x19/0x1b
[ 5686.593723]  [<ffffffff875802d1>] panic+0xe8/0x21f
[ 5686.594578]  [<ffffffffc075ba1b>] lbug_with_loc+0x9b/0xa0 [libcfs]
[ 5686.595672]  [<ffffffffc12e1f56>] osp_precreate_send+0x1106/0x11b0 [osp]
[ 5686.596843]  [<ffffffffc12e464a>] osp_precreate_thread+0x6ba/0x13b0 [osp]
[ 5686.598027]  [<ffffffff86ec7080>] ? wake_up_atomic_t+0x30/0x30
[ 5686.599052]  [<ffffffffc12e3f90>] ? osp_init_pre_fid+0x630/0x630 [osp]
[ 5686.600244]  [<ffffffff86ec5f91>] kthread+0xd1/0xe0
[ 5686.601114]  [<ffffffff86ec5ec0>] ? insert_kthread_work+0x40/0x40
[ 5686.602190]  [<ffffffff87599df7>] ret_from_fork_nospec_begin+0x21/0x21
[ 5686.603335]  [<ffffffff86ec5ec0>] ? insert_kthread_work+0x40/0x40 

or http://testing.linuxhacker.ru/lustre-reports/25993/testresults/conf-sanity1-ldiskfs-DNE-centos7_x86_64-centos7_x86_64/

[ 3227.316101] Lustre: Skipped 1 previous similar message
[ 3227.321723] Lustre: lustre-OST0000: Imperative Recovery not enabled, recovery window 60-180
[ 3228.406622] Lustre: DEBUG MARKER: oleg208-server.virtnet: executing set_default_debug -1 all 8
[ 3230.985273] Lustre: DEBUG MARKER: oleg208-client.virtnet: executing wait_import_state_mount FULL osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
[ 3231.624671] Lustre: lustre-OST0000: Connection restored to 50155b09-b14b-6aa1-56c4-bebc671f968e (at 192.168.201.8@tcp)
[ 3231.627520] Lustre: Skipped 3 previous similar messages
[ 3242.148194] Lustre: MGS: Connection restored to 50155b09-b14b-6aa1-56c4-bebc671f968e (at 192.168.201.8@tcp)
[ 3242.149935] Lustre: Skipped 2 previous similar messages
[ 3243.345638] LustreError: 11-0: lustre-OST0000-osc-MDT0000: operation ost_statfs to node 0@lo failed: rc = -107
[ 3243.351508] Lustre: lustre-OST0000-osc-MDT0000: Connection to lustre-OST0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
[ 3243.361084] Lustre: lustre-OST0000: Not available for connect from 0@lo (stopping)
[ 3243.370317] LustreError: 167-0: lustre-OST0000-osc-MDT0000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[ 3243.404559] LustreError: 10962:0:(osp_precreate.c:683:osp_precreate_send()) ASSERTION( req->rq_transno == 0 ) failed: 
[ 3243.411037] LustreError: 10962:0:(osp_precreate.c:683:osp_precreate_send()) LBUG
[ 3243.414422] Pid: 10962, comm: osp-pre-0-1 3.10.0-7.9-debug #1 SMP Sat Mar 26 23:28:42 EDT 2022
[ 3243.418602] Call Trace:
[ 3243.419815]  [<ffffffffa018d7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 3243.422930]  [<ffffffffa018d88c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 3243.425789]  [<ffffffffa0e7a8b6>] osp_precreate_send+0x11a6/0x11c0 [osp]
[ 3243.428811]  [<ffffffffa0e7c218>] osp_precreate_thread+0x818/0x1360 [osp]
[ 3243.431581]  [<ffffffff810ba114>] kthread+0xe4/0xf0
[ 3243.433857]  [<ffffffff817f3e5d>] ret_from_fork_nospec_begin+0x7/0x21
[ 3243.437050]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 3243.439596] Kernel panic - not syncing: LBUG 

Happens on a variety of branches including b2_12, b2_15 and even master at times, and all the derivatives.



 Comments   
Comment by Minh Diep [ 06/Jan/23 ]

+1 on 2.15.2 https://testing.whamcloud.com/test_sets/0b143d9e-b9a4-423e-be85-5fb590667675

Comment by Serguei Smirnov [ 25/May/23 ]

+1 on b2_15: https://testing.whamcloud.com/test_sets/83ce8d03-abd2-4bc7-b957-f651b84bc179

Comment by Etienne Aujames [ 20/Jun/23 ]

+1 on 2.15.3-RC1: https://testing.whamcloud.com/test_sets/179ae50b-8ea7-4be8-807f-16b73799b897

Comment by Andreas Dilger [ 20/Jun/23 ]

If this is being hit on b2_15 without master/2.16 servers, then it seems to be a different problem than LU-15671, which AFAIK was caused by the "resend precreate RPCs" patch and that was never landed to b2_15, so I don't think the patch https://review.whamcloud.com/51056 "LU-15671 mds: do not send OST_CREATE transno interop" will fix it.

Comment by Sergey Cheremencev [ 25/Oct/23 ]

Hi green , do you still face this issue or we can close this ticket?

Generated at Sat Feb 10 03:25:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.