Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2775

Interop 2.1.4<->2.4 failure on test suite lustre-initialization-1: ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.4.0, Lustre 2.1.5
    • Lustre 2.4.0, Lustre 2.1.4, Lustre 2.1.5
    • 2.1.4 client vs 2.4 server
    • 3
    • 6728

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/fce1d212-6f52-11e2-a955-52540035b04c.

      03:44:53:Lustre: DEBUG MARKER: -----============= acceptance-small: runtests ============----- Sat Feb 2 03:44:51 PST 2013
      03:44:53:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/sbin:/usr/sbin:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre
      03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);
      03:44:53:mpts=$(mount | grep -c /mnt/lustre' ');
      03:44:53:if [ $running -ne $mpts ]; then
      03:44:53:    echo $(hostname) env are INSANE!;
      03:44:53:    exit 1;
      03:44:53:fi
      03:44:53:Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre2' ' /proc/mounts);
      03:44:53:mpts=$(mount | grep -c /mnt/lustre2' ');
      03:44:53:if [ $running -ne $mpts ]; then
      03:44:53:    echo $(hostname) env are INSANE!;
      03:44:53:    exit 1;
      03:44:53:fi
      03:45:04:Lustre: DEBUG MARKER: /usr/sbin/lctl mark Using TIMEOUT=20
      03:45:04:Lustre: DEBUG MARKER: Using TIMEOUT=20
      03:45:05:Lustre: DEBUG MARKER: lctl dl | grep ' IN osc ' 2>/dev/null | wc -l
      03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
      03:45:05:Lustre: 14464:0:(debug.c:326:libcfs_debug_str2mask()) Skipped 3 previous similar messages
      03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark touching \/mnt\/lustre at Sat Feb  2 03:44:58 PST 2013
      03:45:05:Lustre: DEBUG MARKER: touching /mnt/lustre at Sat Feb 2 03:44:58 PST 2013
      03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark create an empty file \/mnt\/lustre\/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: create an empty file /mnt/lustre/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark copying \/etc\/hosts to \/mnt\/lustre\/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: copying /etc/hosts to /mnt/lustre/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark comparing \/etc\/hosts and \/mnt\/lustre\/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: comparing /etc/hosts and /mnt/lustre/hosts.12756
      03:45:05:Lustre: DEBUG MARKER: /usr/sbin/lctl mark renaming \/mnt\/lustre\/hosts.12756 to \/mnt\/lustre\/hosts.12756.ren
      03:45:05:Lustre: DEBUG MARKER: renaming /mnt/lustre/hosts.12756 to /mnt/lustre/hosts.12756.ren
      03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed: 
      03:45:05:LustreError: 15228:0:(lov_obd.c:1827:lov_find_cbdata()) LBUG
      03:45:05:Pid: 15228, comm: mv
      03:45:05:
      03:45:05:Call Trace:
      03:45:05: [<ffffffffa04357f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      03:45:05: [<ffffffffa0435e07>] lbug_with_loc+0x47/0xb0 [libcfs]
      03:45:05: [<ffffffffa095eaba>] lov_find_cbdata+0x63a/0x720 [lov]
      03:45:05: [<ffffffffa0a3c000>] ? return_if_equal+0x0/0x30 [lustre]
      03:45:05: [<ffffffffa08dd361>] ? lmv_find_cbdata+0x1c1/0x5e0 [lmv]
      03:45:05: [<ffffffffa0a3dec2>] find_cbdata+0x212/0x940 [lustre]
      03:45:05: [<ffffffffa0a3e641>] ll_ddelete+0x51/0x2b0 [lustre]
      03:45:05: [<ffffffff81272095>] ? _atomic_dec_and_lock+0x55/0x80
      03:45:05: [<ffffffff81193eea>] dput+0xca/0x150
      03:45:05: [<ffffffff8118b56a>] sys_renameat+0x1fa/0x260
      03:45:05: [<ffffffff811808b4>] ? cp_new_stat+0xe4/0x100
      03:45:05: [<ffffffff81180b8e>] ? vfs_lstat+0x1e/0x20
      03:45:05: [<ffffffff810d6d42>] ? audit_syscall_entry+0x272/0x2a0
      03:45:05: [<ffffffff81503ade>] ? do_page_fault+0x3e/0xa0
      03:45:05: [<ffffffff8118b5eb>] sys_rename+0x1b/0x20
      03:45:05: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
      03:45:05:
      03:46:27:BUG: soft lockup - CPU#0 stuck for 67s! [khelper:15230]
      03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      03:46:27:CPU 0 
      03:46:27:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lquota(U) lmv(U) fid(U) fld(U) ptlrpc(U) obdclass(U) lvfs(U) ksocklnd(U) lnet(U) libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
      03:46:27:
      03:46:27:Pid: 15230, comm: khelper Not tainted 2.6.32-279.14.1.el6.x86_64 #1 Red Hat KVM
      03:46:27:RIP: 0010:[<ffffffff8150098e>]  [<ffffffff8150098e>] _spin_lock+0x1e/0x30
      03:46:27:RSP: 0018:ffff88007a011aa0  EFLAGS: 00000206
      03:46:27:RAX: 0000000000000001 RBX: ffff88007a011aa0 RCX: 0000000000000000
      03:46:27:RDX: 0000000000000000 RSI: ffffffff81a83fc0 RDI: ffffffff81a83fc0
      03:46:27:RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
      03:46:27:R10: ffff8800738f9200 R11: 00000000000000c0 R12: ffffffffa010a5d0
      03:46:27:R13: 0000000000000018 R14: ffff88007a011a18 R15: 0000000000010282
      03:46:27:FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
      03:46:27:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      03:46:27:CR2: 000000000107e228 CR3: 000000007c3cc000 CR4: 00000000000006f0
      03:46:27:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      03:46:27:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      03:46:27:Process khelper (pid: 15230, threadinfo ffff88007a010000, task ffff88007c1bd540)
      03:46:27:Stack:
      03:46:27: ffff88007a011ad0 ffffffff81272095 ffff88007a011b10 ffffffff81a83fc0
      03:46:27:<d> 00007fa32298b000 ffff88007a9e2380 ffff88007a011af0 ffffffff81193eba
      03:46:27:<d> ffff88007a011b90 ffff88007a011ca0 ffff88007a011b10 ffffffff81188b35
      03:46:27:Call Trace:
      

      Attachments

        Activity

          [LU-2775] Interop 2.1.4<->2.4 failure on test suite lustre-initialization-1: ASSERTION( fid_seq_is_mdt(loi->loi_oi.oi_seq) ) failed

          It was because my 1.8 image disk size is wrong, sorry for the noise.

          niu Niu Yawei (Inactive) added a comment - It was because my 1.8 image disk size is wrong, sorry for the noise.

          The conf-sanity 32b still fail with -ENOSPC for 1.8 -> 2.4 test: https://maloo.whamcloud.com/test_sets/6efb0284-81d1-11e2-8564-52540035b04c

          dd: writing `/tmp/t32/mnt/lustre/tmp_file': No space left on device
          1+0 records in
          0+0 records out
          0 bytes (0 B) copied, 0.0667586 s, 0.0 kB/s
           conf-sanity test_32b: @@@@@@ FAIL: dd failed 
          

          It's similar to LU-2768, but not exactly same.

          niu Niu Yawei (Inactive) added a comment - The conf-sanity 32b still fail with -ENOSPC for 1.8 -> 2.4 test: https://maloo.whamcloud.com/test_sets/6efb0284-81d1-11e2-8564-52540035b04c dd: writing `/tmp/t32/mnt/lustre/tmp_file': No space left on device 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0667586 s, 0.0 kB/s conf-sanity test_32b: @@@@@@ FAIL: dd failed It's similar to LU-2768 , but not exactly same.

          There is still a patch for master to land to remove the LASSERT() on data passed from the network:
          http://review.whamcloud.com/5456

          adilger Andreas Dilger added a comment - There is still a patch for master to land to remove the LASSERT() on data passed from the network: http://review.whamcloud.com/5456

          Landed to master and b2_1

          jlevi Jodi Levi (Inactive) added a comment - Landed to master and b2_1
          di.wang Di Wang added a comment - http://review.whamcloud.com/5307 for master
          adilger Andreas Dilger added a comment - - edited http://review.whamcloud.com/5304 for b2_1
          di.wang Di Wang added a comment -

          Hmm, I thought FID-on-OST should be enabled once both MDS and OST is 2.4, though we still keep using IDIF if there are existing files. And denying 2.1 client to access other MDTs will be done by returning EIO (other error), when they try to access Remote directory. Thanks.

          di.wang Di Wang added a comment - Hmm, I thought FID-on-OST should be enabled once both MDS and OST is 2.4, though we still keep using IDIF if there are existing files. And denying 2.1 client to access other MDTs will be done by returning EIO (other error), when they try to access Remote directory. Thanks.

          Di, I thought FID-on-OST was only enabled when multiple MDTs are enabled? That would avoid this problem for now, because we don't have interop between 2.1 clients and multiple MDTs, though there is still a need to fix the LASSERT() on the client.

          It would actually be better NOT to LASSERT() on bad data from the network, and instead just return an error.

          adilger Andreas Dilger added a comment - Di, I thought FID-on-OST was only enabled when multiple MDTs are enabled? That would avoid this problem for now, because we don't have interop between 2.1 clients and multiple MDTs, though there is still a need to fix the LASSERT() on the client. It would actually be better NOT to LASSERT() on bad data from the network, and instead just return an error.
          di.wang Di Wang added a comment -

          It seems 2.1 and 2.4 has different fid_seq_is_mdt

          2.1

          static inline int fid_seq_is_mdt(const __u64 seq)
          {       
                  return seq == FID_SEQ_OST_MDT0 ||
                         (seq >= FID_SEQ_OST_MDT1 && seq <= FID_SEQ_OST_MAX);
          };      
          
          

          master

          static inline int fid_seq_is_mdt(const __u64 seq)
          {                         
                  return seq == FID_SEQ_OST_MDT0 || seq >= FID_SEQ_NORMAL;
          };                             
          

          Clearly client can not recognize the new FID here. And also we still use ost index inside loi to locate OST on the client side, instead of fld lookup. So 2.1 client should be able to live with new normal seq except this ASSERT. Probably we just need to fix the LASSERT in 2.1?

          di.wang Di Wang added a comment - It seems 2.1 and 2.4 has different fid_seq_is_mdt 2.1 static inline int fid_seq_is_mdt(const __u64 seq) { return seq == FID_SEQ_OST_MDT0 || (seq >= FID_SEQ_OST_MDT1 && seq <= FID_SEQ_OST_MAX); }; master static inline int fid_seq_is_mdt(const __u64 seq) { return seq == FID_SEQ_OST_MDT0 || seq >= FID_SEQ_NORMAL; }; Clearly client can not recognize the new FID here. And also we still use ost index inside loi to locate OST on the client side, instead of fld lookup. So 2.1 client should be able to live with new normal seq except this ASSERT. Probably we just need to fix the LASSERT in 2.1?

          People

            di.wang Di Wang
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: