[LU-15794] Downgrade client fails: LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed: Created: 27/Apr/22 Updated: 15/Jun/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cliff White (Inactive) | Assignee: | Zhenyu Xu |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
trevis-86 - servers with version=2.15.0_RC3, client b2_12 build 150 version=2.12.8 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
Immediately on starting sanity, after downgrading the client -
[ 368.034864] Lustre: DEBUG MARKER: trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre^M [ 370.048716] Lustre: DEBUG MARKER: Using TIMEOUT=100^M [ 371.064948] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed: ^M [ 371.066828] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) LBUG^M [ 371.068127] Pid: 11993, comm: rm 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021^M [ 371.069809] Call Trace:^M [ 371.070360] [<ffffffffc08b57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M [ 371.071728] [<ffffffffc08b587c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M [ 371.073141] [<ffffffffc0d3df99>] lov_init_dom+0x8f9/0x990 [lov]^M [ 371.074443] [<ffffffffc0d3e85c>] lov_init_composite+0x82c/0xbe0 [lov]^M [ 371.075827] [<ffffffffc0d3ad00>] lov_object_init+0x130/0x300 [lov]^M [ 371.077174] [<ffffffffc0a3454b>] lu_object_start.isra.31+0x8b/0x120 [obdclass]^M [ 371.078719] [<ffffffffc0a37a54>] lu_object_find_at+0x234/0xab0 [obdclass]^M [ 371.080157] [<ffffffffc0a3830f>] lu_object_find_slice+0x1f/0x90 [obdclass]^M [ 371.081625] [<ffffffffc0a3cba2>] cl_object_find+0x32/0x60 [obdclass]^M [ 371.082966] [<ffffffffc0e9e599>] cl_file_inode_init+0x219/0x380 [lustre]^M [ 371.084399] [<ffffffffc0e78595>] ll_update_inode+0x2d5/0x5e0 [lustre]^M [ 371.085759] [<ffffffffc0e78907>] ll_read_inode2+0x67/0x420 [lustre]^M [ 371.087094] [<ffffffffc0e8661b>] ll_iget+0xdb/0x350 [lustre]^M [ 371.088315] [<ffffffffc0e7a6f3>] ll_prep_inode+0x253/0x970 [lustre]^M [ 371.089648] [<ffffffffc0e87653>] ll_lookup_it+0x523/0x1a20 [lustre]^M [ 371.090989] [<ffffffffc0e89f9b>] ll_lookup_nd+0xbb/0x190 [lustre]^M [ 371.092301] [<ffffffffa0c591d3>] lookup_real+0x23/0x60^M [ 371.093424] [<ffffffffa0c59bf2>] __lookup_hash+0x42/0x60^M [ 371.094571] [<ffffffffa0c60adc>] do_unlinkat+0x14c/0x2d0^M [ 371.095717] [<ffffffffa0c61bbb>] SyS_unlinkat+0x1b/0x40^M [ 371.096862] [<ffffffffa1195f92>] system_call_fastpath+0x25/0x2a^M [ 371.098126] [<ffffffffffffffff>] 0xffffffffffffffff^M [ 371.099214] Kernel panic - not syncing: LBUG^M [ 371.100077] CPU: 0 PID: 11993 Comm: rm Kdump: loaded Tainted: G OE ------------ 3.10.0-1160.45.1.el7.x86_64 #1^M [ 371.102139] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M |
| Comments |
| Comment by Peter Jones [ 27/Apr/22 ] |
|
Bobijam What are you thoughts on this error? Could it be related to LU-13249 Peter |
| Comment by Andreas Dilger [ 27/Apr/22 ] |
|
Cliff, presumably the servers were originally running 2.12.8 before upgrading to 2.15.0-RC3? Were the DOM files that are being removed originally created on 2.15.0 or 2.12.8? I think this looks like - LASSERT(index == 0); + /* DOM entry may be not zero index due to FLR but must start from 0 */ + if (unlikely(lle->lle_extent->e_start != 0)) { + CERROR("%s: DOM entry must be the first stripe in a mirror\n", + lov2obd(dev->ld_lov)->obd_name); + dump_lsm(D_ERROR, lov->lo_lsm); + RETURN(-EINVAL); + } Cliff, was the problematic file mirrored or migrated while running 2.15.0-RC3? What does "lfs getstripe" (from a 2.15 client) show? I think this looks more like a bug in 2.12 rather than 2.15 (patch was landed as commit v2_12_58-12-g44a721b8c1). I think patch has other dependencies (at least https://review.whamcloud.com/45549). At a minimum it would make sense to backport the LASSERT cleanups to b2_12 for 2.12.9? |
| Comment by Cliff White (Inactive) [ 28/Apr/22 ] |
|
Sequence - Installed system w/2.12.8 (servers and client) - Upgraded OSS then MDS then client to 2.15.0-RC3 Upon downgrade of client, LBUG is triggered LBUG continues after server downgrade. No errors at all on the server side, appears to be triggered by something in initial sanity.sh setup as we never get past the startup. MDS log: [ 2325.482110] Lustre: DEBUG MARKER: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 [ 2553.048674] Lustre: MGS: haven't heard from client 84d85de2-7af6-f334-55a2-6f84ca2afbe9 (at 10.240.43.40@tcp) in 230 seconds. I think it's dead, and I am evicting it. exp ffff89d48bab8c00, cur 1651103424 expire 1651103274 last 1651103194 Client suite log: -----============= acceptance-small: sanity ============----- Wed Apr 27 23:46:31 UTC 2022 excepting tests: 103a 103b 103c 104a 160c 161a 161b 161c 208 220 225a 225b 228b 255a 255b 407 253 312 42a 42b 42c 77k skipping tests SLOW=no: 27m 64b 68 71 115 300o trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre trevis-86vm1.trevis.whamcloud.com: Checking config lustre mounted on /mnt/lustre Checking servers environments Checking clients trevis-86vm1.trevis.whamcloud.com environments Using TIMEOUT=100 osc.lustre-OST0000-osc-ffff974f7a0eb800.idle_timeout=debug disable quota as required trevis-86vm2: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 trevis-86vm3: trevis-86vm3.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 osd-ldiskfs.track_declares_assert=1 osd-ldiskfs.track_declares_assert=1 ~ |
| Comment by Cliff White (Inactive) [ 28/Apr/22 ] |
|
tail of Client suite log, set -x ++ for var in LNETLND NETTYPE ++ '[' -n tcp ']' ++ echo -n ' NETTYPE=tcp' + pdsh -t 300 -S -w trevis-86vm2,trevis-86vm3 '(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs VERBOSE=false FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl set_param osd-ldiskfs.track_declares_assert=1 || true")' osd-ldiskfs.track_declares_assert=1 Systems are still up (trevis-86vm[1-3]) I will start the 2.14 up/down testing tomorrow on these nodes unless otherwise advised. |