Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15794

Downgrade client fails: LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed:

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.15.0
    • None
    • trevis-86 - servers with version=2.15.0_RC3, client b2_12 build 150 version=2.12.8
    • 3
    • 9223372036854775807

    Description

      Immediately on starting sanity, after downgrading the client - 

       

      [  368.034864] Lustre: DEBUG MARKER: trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre^M
      [  370.048716] Lustre: DEBUG MARKER: Using TIMEOUT=100^M
      [  371.064948] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed: ^M
      [  371.066828] LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) LBUG^M
      [  371.068127] Pid: 11993, comm: rm 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021^M
      [  371.069809] Call Trace:^M
      [  371.070360]  [<ffffffffc08b57cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]^M
      [  371.071728]  [<ffffffffc08b587c>] lbug_with_loc+0x4c/0xa0 [libcfs]^M
      [  371.073141]  [<ffffffffc0d3df99>] lov_init_dom+0x8f9/0x990 [lov]^M
      [  371.074443]  [<ffffffffc0d3e85c>] lov_init_composite+0x82c/0xbe0 [lov]^M
      [  371.075827]  [<ffffffffc0d3ad00>] lov_object_init+0x130/0x300 [lov]^M
      [  371.077174]  [<ffffffffc0a3454b>] lu_object_start.isra.31+0x8b/0x120 [obdclass]^M
      [  371.078719]  [<ffffffffc0a37a54>] lu_object_find_at+0x234/0xab0 [obdclass]^M
      [  371.080157]  [<ffffffffc0a3830f>] lu_object_find_slice+0x1f/0x90 [obdclass]^M
      [  371.081625]  [<ffffffffc0a3cba2>] cl_object_find+0x32/0x60 [obdclass]^M
      [  371.082966]  [<ffffffffc0e9e599>] cl_file_inode_init+0x219/0x380 [lustre]^M
      [  371.084399]  [<ffffffffc0e78595>] ll_update_inode+0x2d5/0x5e0 [lustre]^M
      [  371.085759]  [<ffffffffc0e78907>] ll_read_inode2+0x67/0x420 [lustre]^M
      [  371.087094]  [<ffffffffc0e8661b>] ll_iget+0xdb/0x350 [lustre]^M
      [  371.088315]  [<ffffffffc0e7a6f3>] ll_prep_inode+0x253/0x970 [lustre]^M
      [  371.089648]  [<ffffffffc0e87653>] ll_lookup_it+0x523/0x1a20 [lustre]^M
      [  371.090989]  [<ffffffffc0e89f9b>] ll_lookup_nd+0xbb/0x190 [lustre]^M
      [  371.092301]  [<ffffffffa0c591d3>] lookup_real+0x23/0x60^M
      [  371.093424]  [<ffffffffa0c59bf2>] __lookup_hash+0x42/0x60^M
      [  371.094571]  [<ffffffffa0c60adc>] do_unlinkat+0x14c/0x2d0^M
      [  371.095717]  [<ffffffffa0c61bbb>] SyS_unlinkat+0x1b/0x40^M
      [  371.096862]  [<ffffffffa1195f92>] system_call_fastpath+0x25/0x2a^M
      [  371.098126]  [<ffffffffffffffff>] 0xffffffffffffffff^M
      [  371.099214] Kernel panic - not syncing: LBUG^M
      [  371.100077] CPU: 0 PID: 11993 Comm: rm Kdump: loaded Tainted: G           OE  ------------   3.10.0-1160.45.1.el7.x86_64 #1^M
      [  371.102139] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011^M 

      Attachments

        Issue Links

          Activity

            [LU-15794] Downgrade client fails: LustreError: 11993:0:(lov_object.c:551:lov_init_dom()) ASSERTION( index == 0 ) failed:

            tail of Client suite log, set -x 

            ++ for var in LNETLND NETTYPE
            ++ '[' -n tcp ']'
            ++ echo -n ' NETTYPE=tcp'
            + pdsh -t 300 -S -w trevis-86vm2,trevis-86vm3 '(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs VERBOSE=false FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl set_param
                 osd-ldiskfs.track_declares_assert=1 || true")'
            osd-ldiskfs.track_declares_assert=1 

            Systems are still up (trevis-86vm[1-3])  

            I will start the 2.14 up/down testing tomorrow on these nodes unless otherwise advised. 

            cwhite_wc Cliff White (Inactive) added a comment - tail of Client suite log, set -x  ++ for var in LNETLND NETTYPE ++ '[' -n tcp ']' ++ echo -n ' NETTYPE=tcp' + pdsh -t 300 -S -w trevis-86vm2,trevis-86vm3 '(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  mds1_FSTYPE=ldiskfs ost1_FSTYPE=ldiskfs VERBOSE=false FSTYPE=ldiskfs NETTYPE=tcp sh -c "/usr/sbin/lctl set_param      osd-ldiskfs.track_declares_assert=1 || true")' osd-ldiskfs.track_declares_assert=1 Systems are still up (trevis-86vm [1-3] )   I will start the 2.14 up/down testing tomorrow on these nodes unless otherwise advised. 

            Sequence - 

            Installed system w/2.12.8 (servers and client)  - 

            Upgraded OSS then MDS then client to 2.15.0-RC3 

            Upon downgrade of client, LBUG is triggered

            LBUG continues after server downgrade. 

            No errors at all on the server side, appears to be triggered by something in initial sanity.sh setup as we never get past the startup. 

            MDS log:

            [ 2325.482110] Lustre: DEBUG MARKER: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
            [ 2553.048674] Lustre: MGS: haven't heard from client 84d85de2-7af6-f334-55a2-6f84ca2afbe9 (at 10.240.43.40@tcp) in 230 seconds. I think it's dead, and I am evicting it. exp ffff89d48bab8c00, cur 1651103424 expire 1651103274 last 1651103194 

             Client suite log:

            -----============= acceptance-small: sanity ============----- Wed Apr 27 23:46:31 UTC 2022
            excepting tests: 103a 103b 103c 104a 160c 161a 161b 161c 208 220 225a 225b 228b 255a 255b 407 253 312 42a 42b 42c 77k
            skipping tests SLOW=no: 27m 64b 68 71 115 300o
            trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre
            trevis-86vm1.trevis.whamcloud.com: Checking config lustre mounted on /mnt/lustre
            Checking servers environments
            Checking clients trevis-86vm1.trevis.whamcloud.com environments
            Using TIMEOUT=100
            osc.lustre-OST0000-osc-ffff974f7a0eb800.idle_timeout=debug
            disable quota as required
            trevis-86vm2: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
            trevis-86vm3: trevis-86vm3.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48
            osd-ldiskfs.track_declares_assert=1
            osd-ldiskfs.track_declares_assert=1
            ~ 
            cwhite_wc Cliff White (Inactive) added a comment - Sequence -  Installed system w/2.12.8 (servers and client)  -  Upgraded OSS then MDS then client to 2.15.0-RC3  Upon downgrade of client, LBUG is triggered LBUG continues after server downgrade.  No errors at all on the server side, appears to be triggered by something in initial sanity.sh setup as we never get past the startup.  MDS log: [ 2325.482110] Lustre: DEBUG MARKER: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 [ 2553.048674] Lustre: MGS: haven't heard from client 84d85de2-7af6-f334-55a2-6f84ca2afbe9 (at 10.240.43.40@tcp) in 230 seconds. I think it's dead, and I am evicting it. exp ffff89d48bab8c00, cur 1651103424 expire 1651103274 last 1651103194  Client suite log: -----============= acceptance-small: sanity ============----- Wed Apr 27 23:46:31 UTC 2022 excepting tests: 103a 103b 103c 104a 160c 161a 161b 161c 208 220 225a 225b 228b 255a 255b 407 253 312 42a 42b 42c 77k skipping tests SLOW=no: 27m 64b 68 71 115 300o trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre trevis-86vm1.trevis.whamcloud.com: Checking config lustre mounted on /mnt/lustre Checking servers environments Checking clients trevis-86vm1.trevis.whamcloud.com environments Using TIMEOUT=100 osc.lustre-OST0000-osc-ffff974f7a0eb800.idle_timeout=debug disable quota as required trevis-86vm2: trevis-86vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 trevis-86vm3: trevis-86vm3.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 48 osd-ldiskfs.track_declares_assert=1 osd-ldiskfs.track_declares_assert=1 ~
            adilger Andreas Dilger added a comment - - edited

            Cliff, presumably the servers were originally running 2.12.8 before upgrading to 2.15.0-RC3? Were the DOM files that are being removed originally created on 2.15.0 or 2.12.8?

            I think this looks like LU-15513 (something strange with the DoM layout on the client), and not related to llog at all (LU-13249 is on the server). I see that this LASSERT is not present on master, having been removed by patch https://review.whamcloud.com/35359 "LU-11421 dom: manual OST-to-DOM migration via mirroring":

            -       LASSERT(index == 0);
            +       /* DOM entry may be not zero index due to FLR but must start from 0 */
            +       if (unlikely(lle->lle_extent->e_start != 0)) {
            +               CERROR("%s: DOM entry must be the first stripe in a mirror\n",
            +                      lov2obd(dev->ld_lov)->obd_name);
            +               dump_lsm(D_ERROR, lov->lo_lsm);
            +               RETURN(-EINVAL);
            +       }
            

            Cliff, was the problematic file mirrored or migrated while running 2.15.0-RC3? What does "lfs getstripe" (from a 2.15 client) show?

            I think this looks more like a bug in 2.12 rather than 2.15 (patch was landed as commit v2_12_58-12-g44a721b8c1). I think patch has other dependencies (at least https://review.whamcloud.com/45549). At a minimum it would make sense to backport the LASSERT cleanups to b2_12 for 2.12.9?

            adilger Andreas Dilger added a comment - - edited Cliff, presumably the servers were originally running 2.12.8 before upgrading to 2.15.0-RC3? Were the DOM files that are being removed originally created on 2.15.0 or 2.12.8? I think this looks like LU-15513 (something strange with the DoM layout on the client), and not related to llog at all ( LU-13249 is on the server). I see that this LASSERT is not present on master, having been removed by patch https://review.whamcloud.com/35359 " LU-11421 dom: manual OST-to-DOM migration via mirroring ": - LASSERT(index == 0); + /* DOM entry may be not zero index due to FLR but must start from 0 */ + if (unlikely(lle->lle_extent->e_start != 0)) { + CERROR( "%s: DOM entry must be the first stripe in a mirror\n" , + lov2obd(dev->ld_lov)->obd_name); + dump_lsm(D_ERROR, lov->lo_lsm); + RETURN(-EINVAL); + } Cliff, was the problematic file mirrored or migrated while running 2.15.0-RC3? What does " lfs getstripe " (from a 2.15 client) show? I think this looks more like a bug in 2.12 rather than 2.15 (patch was landed as commit v2_12_58-12-g44a721b8c1). I think patch has other dependencies (at least https://review.whamcloud.com/45549 ). At a minimum it would make sense to backport the LASSERT cleanups to b2_12 for 2.12.9?
            pjones Peter Jones added a comment -

            Bobijam

            What are you thoughts on this error? Could it be related to LU-13249

            Peter

            pjones Peter Jones added a comment - Bobijam What are you thoughts on this error? Could it be related to LU-13249 Peter

            People

              bobijam Zhenyu Xu
              cwhite_wc Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: