[LU-15794] Downgrade client fails: LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed: Created: 27/Apr/22  Updated: 15/Jun/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: Zhenyu Xu
Resolution: Unresolved Votes: 0
Labels: None
Environment:

trevis-86 - servers with version=2.15.0_RC3, client b2_12 build 150 version=2.12.8


Attachments: File lbug.downgrade.2.12.txt.gz    
Issue Links:
Related
is related to LU-15219 DoM: lfs migrate doesn't work as expe... Resolved
is related to LU-11421 DoM: manual migration OST-MDT, MDT-MDT Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Immediately on starting sanity, after downgrading the client - 

 

[  368.034864] Lustre: DEBUG MARKER: trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre^M
[  370.048716] Lustre: DEBUG MARKER: Using TIMEOUT=100^M
[  371.064948] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed: ^M
[  371.066828] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) LBUG^M
[  371.068127] Pid: 11993, comm: rm 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021^M
[  371.069809] Call Trace:^M
[  371.070360]  [<ffffffffc08b57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
[  371.071728]  [<ffffffffc08b587c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
[  371.073141]  [<ffffffffc0d3df99>] lov_init_dom+0x8f9/0x990 [lov]^M
[  371.074443]  [<ffffffffc0d3e85c>] lov_init_composite+0x82c/0xbe0 [lov]^M
[  371.075827]  [<ffffffffc0d3ad00>] lov_object_init+0x130/0x300 [lov]^M
[  371.077174]  [<ffffffffc0a3454b>] lu_object_start.isra.31+0x8b/0x120 [obdclass]^M
[  371.078719]  [<ffffffffc0a37a54>] lu_object_find_at+0x234/0xab0 [obdclass]^M
[  371.080157]  [<ffffffffc0a3830f>] lu_object_find_slice+0x1f/0x90 [obdclass]^M
[  371.081625]  [<ffffffffc0a3cba2>] cl_object_find+0x32/0x60 [obdclass]^M
[  371.082966]  [<ffffffffc0e9e599>] cl_file_inode_init+0x219/0x380 [lustre]^M
[  371.084399]  [<ffffffffc0e78595>] ll_update_inode+0x2d5/0x5e0 [lustre]^M
[  371.085759]  [<ffffffffc0e78907>] ll_read_inode2+0x67/0x420 [lustre]^M
[  371.087094]  [<ffffffffc0e8661b>] ll_iget+0xdb/0x350 [lustre]^M
[  371.088315]  [<ffffffffc0e7a6f3>] ll_prep_inode+0x253/0x970 [lustre]^M
[  371.089648]  [<ffffffffc0e87653>] ll_lookup_it+0x523/0x1a20 [lustre]^M
[  371.090989]  [<ffffffffc0e89f9b>] ll_lookup_nd+0xbb/0x190 [lustre]^M
[  371.092301]  [<ffffffffa0c591d3>] lookup_real+0x23/0x60^M
[  371.093424]  [<ffffffffa0c59bf2>] __lookup_hash+0x42/0x60^M
[  371.094571]  [<ffffffffa0c60adc>] do_unlinkat+0x14c/0x2d0^M
[  371.095717]  [<ffffffffa0c61bbb>] SyS_unlinkat+0x1b/0x40^M
[  371.096862]  [<ffffffffa1195f92>] system_call_fastpath+0x25/0x2a^M
[  371.098126]  [<ffffffffffffffff>] 0xffffffffffffffff^M
[  371.099214] Kernel panic - not syncing: LBUG^M
[  371.100077] CPU: 0 PID: 11993 Comm: rm Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.45.1.el7.x86_64 #1^M
[  371.102139] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M 


 Comments   
Comment by Peter Jones [ 27/Apr/22 ]

Bobijam

What are you thoughts on this error? Could it be related to LU-13249

Peter

Comment by Andreas Dilger [ 27/Apr/22 ]

Cliff, presumably the servers were originally running 2.12.8 before upgrading to 2.15.0-RC3? Were the DOM files that are being removed originally created on 2.15.0 or 2.12.8?

I think this looks like LU-15513 (something strange with the DoM layout on the client), and not related to llog at all (LU-13249 is on the server). I see that this LASSERT is not present on master, having been removed by patch https://review.whamcloud.com/35359 "LU-11421 dom: manual OST-to-DOM migration via mirroring":

-       LASSERT(index == 0);
+       /* DOM entry may be not zero index due to FLR but must start from 0 */
+       if (unlikely(lle->lle_extent->e_start != 0)) {
+               CERROR("%s: DOM entry must be the first stripe in a mirror\n",
+                      lov2obd(dev->ld_lov)->obd_name);
+               dump_lsm(D_ERROR, lov->lo_lsm);
+               RETURN(-EINVAL);
+       }

Cliff, was the problematic file mirrored or migrated while running 2.15.0-RC3? What does "lfs getstripe" (from a 2.15 client) show?

I think this looks more like a bug in 2.12 rather than 2.15 (patch was landed as commit v2_12_58-12-g44a721b8c1). I think patch has other dependencies (at least https://review.whamcloud.com/45549). At a minimum it would make sense to backport the LASSERT cleanups to b2_12 for 2.12.9?

Comment by Cliff White (Inactive) [ 28/Apr/22 ]

Sequence - 

Installed system w/2.12.8 (servers and client)  - 

Upgraded OSS then MDS then client to 2.15.0-RC3 

Upon downgrade of client, LBUG is triggered

LBUG continues after server downgrade. 

No errors at all on the server side, appears to be triggered by something in initial sanity.sh setup as we never get past the startup. 

MDS log:

[ 2325.482110] Lustre: DEBUG MARKER: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
[ 2553.048674] Lustre: MGS: haven't heard from client 84d85de2-7af6-f334-55a2-6f84ca2afbe9 (at 10.240.43.40@tcp) in 230 seconds. I think it's dead, and I am evicting it. exp ffff89d48bab8c00, cur 1651103424 expire 1651103274 last 1651103194 

 Client suite log:

-----============= acceptance-small: sanity ============----- Wed Apr 27 23:46:31 UTC 2022
excepting tests: 103a 103b 103c 104a 160c 161a 161b 161c 208 220 225a 225b 228b 255a 255b 407 253 312 42a 42b 42c 77k
skipping tests SLOW=no: 27m 64b 68 71 115 300o
trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre
trevis-86vm1.trevis.whamcloud.com: Checking config lustre mounted on /mnt/lustre
Checking servers environments
Checking clients trevis-86vm1.trevis.whamcloud.com environments
Using TIMEOUT=100
osc.lustre-OST0000-osc-ffff974f7a0eb800.idle_timeout=debug
disable quota as required
trevis-86vm2: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
trevis-86vm3: trevis-86vm3.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
osd-ldiskfs.track_declares_assert=1
osd-ldiskfs.track_declares_assert=1
~ 
Comment by Cliff White (Inactive) [ 28/Apr/22 ]

tail of Client suite log, set -x 

++ for var in LNETLND NETTYPE
++ '[' -n tcp ']'
++ echo -n ' NETTYPE=tcp'
+ pdsh -t 300 -S -w trevis-86vm2,trevis-86vm3 '(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs VERBOSE=false FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl set_param
     osd-ldiskfs.track_declares_assert=1 || true")'
osd-ldiskfs.track_declares_assert=1 

Systems are still up (trevis-86vm[1-3])  

I will start the 2.14 up/down testing tomorrow on these nodes unless otherwise advised. 

Generated at Sat Feb 10 03:21:21 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.