Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9370

Lustre 2.9 + zfs 0.7 + draid = OSS hangup

Details

    • Bug
    • Resolution: Not a Bug
    • Minor
    • None
    • Lustre 2.9.0
    • CentOS 7 in a Hyper-V vm
    • 1
    • 3
    • 9223372036854775807

    Description

      I'm trying to build a draid based OST for lustre.

      Initially created as [https://github.com/thegreatgazoo/zfs/issues/2|thegreatgazoo/zfs issue].

      Generic lustre MGS/MDT is up'n'running.

      Fresh VM (4 CPU, 4G RAM) with CentOS 7 "minimal" install and 18 scsi disks (images).

      Perform yum -y update ; reboot, then run setup-node.sh NODE from workstation.
      Ssh to the NODE and run ./mkzpool.sh:

      [root@node26 ~]# ./mkzpool.sh 
      + zpool list
      + grep -w 'no pools available'
      + zpool destroy oss3pool
      + zpool list
      + grep -w 'no pools available'
      no pools available
      + '[' -f 17.nvl ']'
      + draidcfg -r 17.nvl
      dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare
      Using 32 base permutations
        15, 2, 8, 7,10, 5, 4,16, 1,13,14, 9,11,12, 3, 6, 0,
         5,15,14, 9, 0,11,13, 4, 3,12, 8,10, 7, 1, 6, 2,16,
        10,11,14, 5,15, 2,13, 6, 1, 3, 4, 7,12,16, 9, 0, 8,
        13, 2,12,14, 8, 0, 7, 4, 9,15,11, 6, 3,16, 1, 5,10,
        13, 5, 2,16, 6, 0, 4, 8,10, 1, 3,14, 9,11,12, 7,15,
         8,12, 3,14, 0, 4,16, 6, 2,11, 1, 7, 9,15,13, 5,10,
        16,14, 2, 9, 7, 4,11, 0, 6,12,10, 8, 1,13,15, 5, 3,
         5,16, 6, 1,10,15,11, 3, 8,14, 2,12, 0, 7, 9, 4,13,
         4,12, 8,10,14, 9, 6,11,15, 0, 3,13, 7, 2, 5,16, 1,
        10,14,16,11,12, 2, 5, 3, 4, 7, 0, 1, 6, 9,13, 8,15,
         2, 1,11,15,16, 6,12, 3,10,13, 8, 5, 4, 0, 7, 9,14,
        15,14, 1, 5,16, 2,12, 8, 9, 6,11,10, 3, 0, 7, 4,13,
         1, 5,10, 9, 2, 8, 4,16, 7,11, 3,12, 6,14, 0,13,15,
         3, 7,16,10,13, 2, 6, 8,14,15,12,11, 0, 9, 1, 4, 5,
        15, 2,14, 8, 5,16, 3,13, 4, 1, 9,12,10, 0, 6, 7,11,
        14,12,11,15,16,10, 2, 9, 8, 4, 3, 1,13, 5, 7, 0, 6,
         7,13, 2,11,14, 0, 1, 8, 9,10,16, 4, 6,12, 5, 3,15,
        16, 1,11, 4, 3, 9, 6,13, 5, 7,10,15,14,12, 2, 0, 8,
         0, 5, 2,10,16,12, 6, 3,11,14, 1, 9, 7,15, 4, 8,13,
         8,13,11, 4,10, 6, 7,16, 5,12, 9,14, 2, 3, 0,15, 1,
         9, 6,12,16, 4, 7, 3, 0, 2,15,13, 8,11,14, 5,10, 1,
         8,12, 0, 6,15, 7, 4,13,14,10, 1, 9, 5, 3,11, 2,16,
         5,15, 9,10,16, 6,11, 0, 7,13, 8,14, 3, 4, 1,12, 2,
        15,14, 2, 9, 4,11, 7, 1, 6,10, 5, 0, 8,12,13,16, 3,
        15,16, 0,10, 3,12,11, 7, 1, 8, 6,13, 4, 5, 9, 2,14,
        15, 4, 7,13,14, 2, 9,10,16, 1,11,12, 8, 0, 3, 5, 6,
        15, 8,13, 0, 4, 7, 3,14, 5,12, 2, 9,10,11, 6,16, 1,
         0, 7, 5, 3, 1,14,16, 4, 2,15,12, 8,10, 6, 9,11,13,
         7, 6, 0,15,16,11, 8, 1, 5,12,13,14,10, 9, 3, 2, 4,
        14,16,10, 6, 4,13, 3, 1,15,12,11, 8, 9, 5, 0, 7, 2,
         9, 3, 5,15,10,11, 8, 7, 2,14, 6,13, 0, 4, 1,12,16,
         4, 6, 7,14, 5, 3,12, 1,13, 9,16, 2, 0,10, 8,11,15,
      + zpool create -f oss3pool draid1 cfg=17.nvl /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr
      + zpool list
      NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
      oss3pool  14,7G   612K  14,7G         -     0%     0%  1.00x  ONLINE  -
      + zpool status
        pool: oss3pool
       state: ONLINE
        scan: none requested
      config:
      
      	NAME            STATE     READ WRITE CKSUM
      	oss3pool        ONLINE       0     0     0
      	  draid1-0      ONLINE       0     0     0
      	    sdb         ONLINE       0     0     0
      	    sdc         ONLINE       0     0     0
      	    sdd         ONLINE       0     0     0
      	    sde         ONLINE       0     0     0
      	    sdf         ONLINE       0     0     0
      	    sdg         ONLINE       0     0     0
      	    sdh         ONLINE       0     0     0
      	    sdi         ONLINE       0     0     0
      	    sdj         ONLINE       0     0     0
      	    sdk         ONLINE       0     0     0
      	    sdl         ONLINE       0     0     0
      	    sdm         ONLINE       0     0     0
      	    sdn         ONLINE       0     0     0
      	    sdo         ONLINE       0     0     0
      	    sdp         ONLINE       0     0     0
      	    sdq         ONLINE       0     0     0
      	    sdr         ONLINE       0     0     0
      	spares
      	  $draid1-0-s0  AVAIL   
      	  $draid1-0-s1  AVAIL   
      
      errors: No known data errors
      + grep oss3pool
      + mount
      oss3pool on /oss3pool type zfs (rw,xattr,noacl)
      + mkfs.lustre --reformat --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01
      
         Permanent disk data:
      Target:     ZFS01:OST0003
      Index:      3
      Lustre FS:  ZFS01
      Mount type: zfs
      Flags:      0x62
                    (OST first_time update )
      Persistent mount opts: 
      Parameters: mgsnode=172.17.32.220@tcp
      
      mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01
      Writing oss3pool/ZFS01 properties
        lustre:version=1
        lustre:flags=98
        lustre:index=3
        lustre:fsname=ZFS01
        lustre:svname=ZFS01:OST0003
        lustre:mgsnode=172.17.32.220@tcp
      + '[' -d /lustre/ZFS01/. ']'
      + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01
      arg[0] = /sbin/mount.lustre
      arg[1] = -v
      arg[2] = -o
      arg[3] = rw
      arg[4] = oss3pool/ZFS01
      arg[5] = /lustre/ZFS01
      source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01
      options = rw
      checking for existing Lustre data: found
      Writing oss3pool/ZFS01 properties
        lustre:version=1
        lustre:flags=34
        lustre:index=3
        lustre:fsname=ZFS01
        lustre:svname=ZFS01:OST0003
        lustre:mgsnode=172.17.32.220@tcp
      mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,virgin,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01
      mount.lustre: mount oss3pool/ZFS01 at /lustre/ZFS01 failed: Address already in use retries left: 0
      mount.lustre: mount oss3pool/ZFS01 at /lustre/ZFS01 failed: Address already in use
      The target service's index is already in use. (oss3pool/ZFS01)
      [root@node26 ~]# mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01
      arg[0] = /sbin/mount.lustre
      arg[1] = -v
      arg[2] = -o
      arg[3] = rw
      arg[4] = oss3pool/ZFS01
      arg[5] = /lustre/ZFS01
      source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01
      options = rw
      checking for existing Lustre data: found
      mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,virgin,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01
      
      

      Attachments

        Activity

          [LU-9370] Lustre 2.9 + zfs 0.7 + draid = OSS hangup

          BTW, there are quite a while (37) of modules here:

          [root@node26 ~]# find . -name '*.ko' 
          ./spl/module/spl/spl.ko
          ./spl/module/splat/splat.ko
          ./zfs/module/avl/zavl.ko
          ./zfs/module/icp/icp.ko
          ./zfs/module/nvpair/znvpair.ko
          ./zfs/module/unicode/zunicode.ko
          ./zfs/module/zcommon/zcommon.ko
          ./zfs/module/zfs/zfs.ko
          ./zfs/module/zpios/zpios.ko
          ./lustre-release/libcfs/libcfs/libcfs.ko
          ./lustre-release/lnet/klnds/o2iblnd/ko2iblnd.ko
          ./lustre-release/lnet/klnds/socklnd/ksocklnd.ko
          ./lustre-release/lnet/lnet/lnet.ko
          ./lustre-release/lnet/selftest/lnet_selftest.ko
          ./lustre-release/lustre/fid/fid.ko
          ./lustre-release/lustre/fld/fld.ko
          ./lustre-release/lustre/lfsck/lfsck.ko
          ./lustre-release/lustre/llite/llite_lloop.ko
          ./lustre-release/lustre/llite/lustre.ko
          ./lustre-release/lustre/lmv/lmv.ko
          ./lustre-release/lustre/lod/lod.ko
          ./lustre-release/lustre/lov/lov.ko
          ./lustre-release/lustre/mdc/mdc.ko
          ./lustre-release/lustre/mdd/mdd.ko
          ./lustre-release/lustre/mdt/mdt.ko
          ./lustre-release/lustre/mgc/mgc.ko
          ./lustre-release/lustre/mgs/mgs.ko
          ./lustre-release/lustre/obdclass/obdclass.ko
          ./lustre-release/lustre/obdclass/llog_test.ko
          ./lustre-release/lustre/obdecho/obdecho.ko
          ./lustre-release/lustre/ofd/ofd.ko
          ./lustre-release/lustre/osc/osc.ko
          ./lustre-release/lustre/osd-zfs/osd_zfs.ko
          ./lustre-release/lustre/osp/osp.ko
          ./lustre-release/lustre/ost/ost.ko
          ./lustre-release/lustre/ptlrpc/ptlrpc.ko
          ./lustre-release/lustre/quota/lquota.ko
          

          and it's not obvious to me which and when to load...
          In the debug_info.20170424_044934.896525307_0400-5362-node26.zip one may see spl and zfs things loaded.

          Ok, I'll try to load all or some of them now and re-try.

          jno jno (Inactive) added a comment - BTW, there are quite a while (37) of modules here: [root@node26 ~]# find . -name '*.ko' ./spl/module/spl/spl.ko ./spl/module/splat/splat.ko ./zfs/module/avl/zavl.ko ./zfs/module/icp/icp.ko ./zfs/module/nvpair/znvpair.ko ./zfs/module/unicode/zunicode.ko ./zfs/module/zcommon/zcommon.ko ./zfs/module/zfs/zfs.ko ./zfs/module/zpios/zpios.ko ./lustre-release/libcfs/libcfs/libcfs.ko ./lustre-release/lnet/klnds/o2iblnd/ko2iblnd.ko ./lustre-release/lnet/klnds/socklnd/ksocklnd.ko ./lustre-release/lnet/lnet/lnet.ko ./lustre-release/lnet/selftest/lnet_selftest.ko ./lustre-release/lustre/fid/fid.ko ./lustre-release/lustre/fld/fld.ko ./lustre-release/lustre/lfsck/lfsck.ko ./lustre-release/lustre/llite/llite_lloop.ko ./lustre-release/lustre/llite/lustre.ko ./lustre-release/lustre/lmv/lmv.ko ./lustre-release/lustre/lod/lod.ko ./lustre-release/lustre/lov/lov.ko ./lustre-release/lustre/mdc/mdc.ko ./lustre-release/lustre/mdd/mdd.ko ./lustre-release/lustre/mdt/mdt.ko ./lustre-release/lustre/mgc/mgc.ko ./lustre-release/lustre/mgs/mgs.ko ./lustre-release/lustre/obdclass/obdclass.ko ./lustre-release/lustre/obdclass/llog_test.ko ./lustre-release/lustre/obdecho/obdecho.ko ./lustre-release/lustre/ofd/ofd.ko ./lustre-release/lustre/osc/osc.ko ./lustre-release/lustre/osd-zfs/osd_zfs.ko ./lustre-release/lustre/osp/osp.ko ./lustre-release/lustre/ost/ost.ko ./lustre-release/lustre/ptlrpc/ptlrpc.ko ./lustre-release/lustre/quota/lquota.ko and it's not obvious to me which and when to load... In the debug_info.20170424_044934.896525307_0400-5362-node26.zip one may see spl and zfs things loaded. Ok, I'll try to load all or some of them now and re-try.
          jsalians_intel John Salinas (Inactive) added a comment - - edited

          Right but look in your lsmod output – it does not appear lustre or lnet are there.

          $ grep lustre OUTPUT.lsmod.txt
          $

          This is why all of your Lustre commands are failing – such as: invalid parameter 'dump_kernel'
          open(dump_kernel) failed: No such file or directory

          Could you please load the Lustre & lnet kernel modules and try this again? Also I do not see output from the mds. If there are still issues that would be helpful.

          Thank you

          jsalians_intel John Salinas (Inactive) added a comment - - edited Right but look in your lsmod output – it does not appear lustre or lnet are there. $ grep lustre OUTPUT.lsmod.txt $ This is why all of your Lustre commands are failing – such as: invalid parameter 'dump_kernel' open(dump_kernel) failed: No such file or directory Could you please load the Lustre & lnet kernel modules and try this again? Also I do not see output from the mds. If there are still issues that would be helpful. Thank you

          Hi there,

           

          Yes, it was installed. From build (make install).

          I.e. one may see

          [root@node26 ~]# lustre_
          lustre_req_history lustre_routes_config lustre_rsync
          lustre_rmmod lustre_routes_conversion lustre_start
          [root@node26 ~]# lustre_
          
          jno jno (Inactive) added a comment - Hi there,   Yes, it was installed. From build (make install). I.e. one may see [root@node26 ~]# lustre_ lustre_req_history lustre_routes_config lustre_rsync lustre_rmmod lustre_routes_conversion lustre_start [root@node26 ~]# lustre_

          Greetings,

          Maybe I am not looking at this right but it does not look like Lustre is installed on the OSS node? Can you confirm? In the rpm list I didn't see the rpms and the Lustre command did not appear to be able to run.

          jsalians_intel John Salinas (Inactive) added a comment - Greetings, Maybe I am not looking at this right but it does not look like Lustre is installed on the OSS node? Can you confirm? In the rpm list I didn't see the rpms and the Lustre command did not appear to be able to run.
          jno jno (Inactive) added a comment - - edited

          I've added calls to collect-info.sh right into mkzpool.sh script (with sleep/sync/sleep magic to have the last zip kept).

          Here we are:

          • debug_info.20170424_044901.648221747_0400-3235-node26.zip
          • debug_info.20170424_044924.231035970_0400-4268-node26.zip
          • debug_info.20170424_044934.896525307_0400-5362-node26.zip
          • - console at hang (I dunno what "dcla" means here)
             
            [root@node26 ~]# ./mkzpool.sh 
            + ./collect-info.sh
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/ (stored 0%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/Now (deflated 51%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.script.log (deflated 89%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.rpm-qa.txt (deflated 69%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lsmod.txt (deflated 66%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lsblk.txt (deflated 79%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.df.txt (deflated 54%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.mount.txt (deflated 74%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.dmesg.txt (deflated 73%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.zpool_events.txt (deflated 62%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.zpool_events_verbose.txt (deflated 79%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lctl_dl.txt (stored 0%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lctl_dk.txt (deflated 12%)
              adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.messages (deflated 83%)
            + zpool list
            + grep -w 'no pools available'
            + zpool destroy oss3pool
            + zpool list
            + grep -w 'no pools available'
            no pools available
            + '[' -f 17.nvl ']'
            + draidcfg -r 17.nvl
            dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare
            Using 32 base permutations
              15, 2, 8, 7,10, 5, 4,16, 1,13,14, 9,11,12, 3, 6, 0,
               5,15,14, 9, 0,11,13, 4, 3,12, 8,10, 7, 1, 6, 2,16,
              10,11,14, 5,15, 2,13, 6, 1, 3, 4, 7,12,16, 9, 0, 8,
              13, 2,12,14, 8, 0, 7, 4, 9,15,11, 6, 3,16, 1, 5,10,
              13, 5, 2,16, 6, 0, 4, 8,10, 1, 3,14, 9,11,12, 7,15,
               8,12, 3,14, 0, 4,16, 6, 2,11, 1, 7, 9,15,13, 5,10,
              16,14, 2, 9, 7, 4,11, 0, 6,12,10, 8, 1,13,15, 5, 3,
               5,16, 6, 1,10,15,11, 3, 8,14, 2,12, 0, 7, 9, 4,13,
               4,12, 8,10,14, 9, 6,11,15, 0, 3,13, 7, 2, 5,16, 1,
              10,14,16,11,12, 2, 5, 3, 4, 7, 0, 1, 6, 9,13, 8,15,
               2, 1,11,15,16, 6,12, 3,10,13, 8, 5, 4, 0, 7, 9,14,
              15,14, 1, 5,16, 2,12, 8, 9, 6,11,10, 3, 0, 7, 4,13,
               1, 5,10, 9, 2, 8, 4,16, 7,11, 3,12, 6,14, 0,13,15,
               3, 7,16,10,13, 2, 6, 8,14,15,12,11, 0, 9, 1, 4, 5,
              15, 2,14, 8, 5,16, 3,13, 4, 1, 9,12,10, 0, 6, 7,11,
              14,12,11,15,16,10, 2, 9, 8, 4, 3, 1,13, 5, 7, 0, 6,
               7,13, 2,11,14, 0, 1, 8, 9,10,16, 4, 6,12, 5, 3,15,
              16, 1,11, 4, 3, 9, 6,13, 5, 7,10,15,14,12, 2, 0, 8,
               0, 5, 2,10,16,12, 6, 3,11,14, 1, 9, 7,15, 4, 8,13,
               8,13,11, 4,10, 6, 7,16, 5,12, 9,14, 2, 3, 0,15, 1,
               9, 6,12,16, 4, 7, 3, 0, 2,15,13, 8,11,14, 5,10, 1,
               8,12, 0, 6,15, 7, 4,13,14,10, 1, 9, 5, 3,11, 2,16,
               5,15, 9,10,16, 6,11, 0, 7,13, 8,14, 3, 4, 1,12, 2,
              15,14, 2, 9, 4,11, 7, 1, 6,10, 5, 0, 8,12,13,16, 3,
              15,16, 0,10, 3,12,11, 7, 1, 8, 6,13, 4, 5, 9, 2,14,
              15, 4, 7,13,14, 2, 9,10,16, 1,11,12, 8, 0, 3, 5, 6,
              15, 8,13, 0, 4, 7, 3,14, 5,12, 2, 9,10,11, 6,16, 1,
               0, 7, 5, 3, 1,14,16, 4, 2,15,12, 8,10, 6, 9,11,13,
               7, 6, 0,15,16,11, 8, 1, 5,12,13,14,10, 9, 3, 2, 4,
              14,16,10, 6, 4,13, 3, 1,15,12,11, 8, 9, 5, 0, 7, 2,
               9, 3, 5,15,10,11, 8, 7, 2,14, 6,13, 0, 4, 1,12,16,
               4, 6, 7,14, 5, 3,12, 1,13, 9,16, 2, 0,10, 8,11,15,
            + zpool create -f oss3pool draid1 cfg=17.nvl /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr
            + zpool list
            NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
            oss3pool  14,7G   612K  14,7G         -     0%     0%  1.00x  ONLINE  -
            + zpool status
              pool: oss3pool
             state: ONLINE
              scan: none requested
            config:
            
            	NAME            STATE     READ WRITE CKSUM
            	oss3pool        ONLINE       0     0     0
            	  draid1-0      ONLINE       0     0     0
            	    sdb         ONLINE       0     0     0
            	    sdc         ONLINE       0     0     0
            	    sdd         ONLINE       0     0     0
            	    sde         ONLINE       0     0     0
            	    sdf         ONLINE       0     0     0
            	    sdg         ONLINE       0     0     0
            	    sdh         ONLINE       0     0     0
            	    sdi         ONLINE       0     0     0
            	    sdj         ONLINE       0     0     0
            	    sdk         ONLINE       0     0     0
            	    sdl         ONLINE       0     0     0
            	    sdm         ONLINE       0     0     0
            	    sdn         ONLINE       0     0     0
            	    sdo         ONLINE       0     0     0
            	    sdp         ONLINE       0     0     0
            	    sdq         ONLINE       0     0     0
            	    sdr         ONLINE       0     0     0
            	spares
            	  $draid1-0-s0  AVAIL   
            	  $draid1-0-s1  AVAIL   
            
            errors: No known data errors
            + grep oss3pool
            + mount
            oss3pool on /oss3pool type zfs (rw,xattr,noacl)
            + ./collect-info.sh
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/ (stored 0%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/Now (deflated 51%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.script.log (deflated 89%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.rpm-qa.txt (deflated 69%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lsmod.txt (deflated 66%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lsblk.txt (deflated 79%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.df.txt (deflated 55%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.mount.txt (deflated 74%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.dmesg.txt (deflated 73%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.zpool_events.txt (deflated 71%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.zpool_events_verbose.txt (deflated 85%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lctl_dl.txt (stored 0%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lctl_dk.txt (deflated 12%)
              adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.messages (deflated 83%)
            + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01
            
               Permanent disk data:
            Target:     ZFS01-OST0003
            Index:      3
            Lustre FS:  ZFS01
            Mount type: zfs
            Flags:      0x42
                          (OST update )
            Persistent mount opts: 
            Parameters: mgsnode=172.17.32.220@tcp
            
            mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01
            Writing oss3pool/ZFS01 properties
              lustre:version=1
              lustre:flags=66
              lustre:index=3
              lustre:fsname=ZFS01
              lustre:svname=ZFS01-OST0003
              lustre:mgsnode=172.17.32.220@tcp
            + ./collect-info.sh
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/ (stored 0%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/Now (deflated 51%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.script.log (deflated 89%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.rpm-qa.txt (deflated 69%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lsmod.txt (deflated 66%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lsblk.txt (deflated 79%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.df.txt (deflated 55%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.mount.txt (deflated 74%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.dmesg.txt (deflated 73%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.zpool_events.txt (deflated 72%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.zpool_events_verbose.txt (deflated 85%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lctl_dl.txt (stored 0%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lctl_dk.txt (deflated 12%)
              adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.messages (deflated 83%)
            + '[' -d /lustre/ZFS01/. ']'
            + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01
            arg[0] = /sbin/mount.lustre
            arg[1] = -v
            arg[2] = -o
            arg[3] = rw
            arg[4] = oss3pool/ZFS01
            arg[5] = /lustre/ZFS01
            source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01
            options = rw
            checking for existing Lustre data: found
            Writing oss3pool/ZFS01 properties
              lustre:version=1
              lustre:flags=2
              lustre:index=3
              lustre:fsname=ZFS01
              lustre:svname=ZFS01-OST0003
              lustre:mgsnode=172.17.32.220@tcp
            mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01
            
            
            
            
            

             And yes, it now (after fixing my mistake with --index) crashes on the 1st mount.

          jno jno (Inactive) added a comment - - edited I've added calls to collect-info.sh right into mkzpool.sh script (with sleep/sync/sleep magic to have the last zip kept). Here we are: debug_info.20170424_044901.648221747_0400-3235-node26.zip debug_info.20170424_044924.231035970_0400-4268-node26.zip debug_info.20170424_044934.896525307_0400-5362-node26.zip - console at hang (I dunno what "dcla" means here)   [root@node26 ~]# ./mkzpool.sh + ./collect-info.sh adding: debug_info.20170424_044901.648221747_0400-3235-node26/ (stored 0%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/Now (deflated 51%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.script.log (deflated 89%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.rpm-qa.txt (deflated 69%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lsmod.txt (deflated 66%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lsblk.txt (deflated 79%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.df.txt (deflated 54%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.mount.txt (deflated 74%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.dmesg.txt (deflated 73%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.zpool_events.txt (deflated 62%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.zpool_events_verbose.txt (deflated 79%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lctl_dl.txt (stored 0%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.lctl_dk.txt (deflated 12%) adding: debug_info.20170424_044901.648221747_0400-3235-node26/OUTPUT.messages (deflated 83%) + zpool list + grep -w 'no pools available' + zpool destroy oss3pool + zpool list + grep -w 'no pools available' no pools available + '[' -f 17.nvl ']' + draidcfg -r 17.nvl dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare Using 32 base permutations 15, 2, 8, 7,10, 5, 4,16, 1,13,14, 9,11,12, 3, 6, 0, 5,15,14, 9, 0,11,13, 4, 3,12, 8,10, 7, 1, 6, 2,16, 10,11,14, 5,15, 2,13, 6, 1, 3, 4, 7,12,16, 9, 0, 8, 13, 2,12,14, 8, 0, 7, 4, 9,15,11, 6, 3,16, 1, 5,10, 13, 5, 2,16, 6, 0, 4, 8,10, 1, 3,14, 9,11,12, 7,15, 8,12, 3,14, 0, 4,16, 6, 2,11, 1, 7, 9,15,13, 5,10, 16,14, 2, 9, 7, 4,11, 0, 6,12,10, 8, 1,13,15, 5, 3, 5,16, 6, 1,10,15,11, 3, 8,14, 2,12, 0, 7, 9, 4,13, 4,12, 8,10,14, 9, 6,11,15, 0, 3,13, 7, 2, 5,16, 1, 10,14,16,11,12, 2, 5, 3, 4, 7, 0, 1, 6, 9,13, 8,15, 2, 1,11,15,16, 6,12, 3,10,13, 8, 5, 4, 0, 7, 9,14, 15,14, 1, 5,16, 2,12, 8, 9, 6,11,10, 3, 0, 7, 4,13, 1, 5,10, 9, 2, 8, 4,16, 7,11, 3,12, 6,14, 0,13,15, 3, 7,16,10,13, 2, 6, 8,14,15,12,11, 0, 9, 1, 4, 5, 15, 2,14, 8, 5,16, 3,13, 4, 1, 9,12,10, 0, 6, 7,11, 14,12,11,15,16,10, 2, 9, 8, 4, 3, 1,13, 5, 7, 0, 6, 7,13, 2,11,14, 0, 1, 8, 9,10,16, 4, 6,12, 5, 3,15, 16, 1,11, 4, 3, 9, 6,13, 5, 7,10,15,14,12, 2, 0, 8, 0, 5, 2,10,16,12, 6, 3,11,14, 1, 9, 7,15, 4, 8,13, 8,13,11, 4,10, 6, 7,16, 5,12, 9,14, 2, 3, 0,15, 1, 9, 6,12,16, 4, 7, 3, 0, 2,15,13, 8,11,14, 5,10, 1, 8,12, 0, 6,15, 7, 4,13,14,10, 1, 9, 5, 3,11, 2,16, 5,15, 9,10,16, 6,11, 0, 7,13, 8,14, 3, 4, 1,12, 2, 15,14, 2, 9, 4,11, 7, 1, 6,10, 5, 0, 8,12,13,16, 3, 15,16, 0,10, 3,12,11, 7, 1, 8, 6,13, 4, 5, 9, 2,14, 15, 4, 7,13,14, 2, 9,10,16, 1,11,12, 8, 0, 3, 5, 6, 15, 8,13, 0, 4, 7, 3,14, 5,12, 2, 9,10,11, 6,16, 1, 0, 7, 5, 3, 1,14,16, 4, 2,15,12, 8,10, 6, 9,11,13, 7, 6, 0,15,16,11, 8, 1, 5,12,13,14,10, 9, 3, 2, 4, 14,16,10, 6, 4,13, 3, 1,15,12,11, 8, 9, 5, 0, 7, 2, 9, 3, 5,15,10,11, 8, 7, 2,14, 6,13, 0, 4, 1,12,16, 4, 6, 7,14, 5, 3,12, 1,13, 9,16, 2, 0,10, 8,11,15, + zpool create -f oss3pool draid1 cfg=17.nvl /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr + zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT oss3pool 14,7G 612K 14,7G - 0% 0% 1.00x ONLINE - + zpool status pool: oss3pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM oss3pool ONLINE 0 0 0 draid1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 sdl ONLINE 0 0 0 sdm ONLINE 0 0 0 sdn ONLINE 0 0 0 sdo ONLINE 0 0 0 sdp ONLINE 0 0 0 sdq ONLINE 0 0 0 sdr ONLINE 0 0 0 spares $draid1-0-s0 AVAIL $draid1-0-s1 AVAIL errors: No known data errors + grep oss3pool + mount oss3pool on /oss3pool type zfs (rw,xattr,noacl) + ./collect-info.sh adding: debug_info.20170424_044924.231035970_0400-4268-node26/ (stored 0%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/Now (deflated 51%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.script.log (deflated 89%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.rpm-qa.txt (deflated 69%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lsmod.txt (deflated 66%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lsblk.txt (deflated 79%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.df.txt (deflated 55%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.mount.txt (deflated 74%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.dmesg.txt (deflated 73%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.zpool_events.txt (deflated 71%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.zpool_events_verbose.txt (deflated 85%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lctl_dl.txt (stored 0%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.lctl_dk.txt (deflated 12%) adding: debug_info.20170424_044924.231035970_0400-4268-node26/OUTPUT.messages (deflated 83%) + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01 Permanent disk data: Target: ZFS01-OST0003 Index: 3 Lustre FS: ZFS01 Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=172.17.32.220@tcp mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01 Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=66 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp + ./collect-info.sh adding: debug_info.20170424_044934.896525307_0400-5362-node26/ (stored 0%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/Now (deflated 51%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.script.log (deflated 89%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.rpm-qa.txt (deflated 69%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lsmod.txt (deflated 66%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lsblk.txt (deflated 79%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.df.txt (deflated 55%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.mount.txt (deflated 74%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.show_kernelmod_params.txt (deflated 67%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.kernel_debug_trace.txt (deflated 57%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.dmesg.txt (deflated 73%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.zpool_events.txt (deflated 72%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.zpool_events_verbose.txt (deflated 85%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lctl_dl.txt (stored 0%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.lctl_dk.txt (deflated 12%) adding: debug_info.20170424_044934.896525307_0400-5362-node26/OUTPUT.messages (deflated 83%) + '[' -d /lustre/ZFS01/. ']' + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = oss3pool/ZFS01 arg[5] = /lustre/ZFS01 source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01 options = rw checking for existing Lustre data: found Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=2 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01  And yes, it now (after fixing my mistake with --index) crashes on the 1st mount.

          Well, 

          It's the entier system (VM) hangs (crashes), not a mount, an app, or a session.

          Not every time I'm lucky enough to see that panic on the console. Usually it just hangs.
           

          There are 

          • collect-info.sh - the script used to collect debug info
          • debug_info.20170424_042703.274836974_0400-4268-node26.zip - collected info before the crash
          • debug_info.20170424_043500.551470248_0400-3183-node26.zip - collected info after the crash
          • - console at crash scrolled up
          • - console at crash scrolled down
             
            [root@node26 ~]# ./mkzpool.sh 
             + zpool list
             + grep -w 'no pools available'
             + zpool destroy oss3pool
             + zpool list
             + grep -w 'no pools available'
             no pools available
             + '[' -f 17.nvl ']'
             + draidcfg -r 17.nvl
             dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare
             Using 32 base permutations
             15, 2, 8, 7,10, 5, 4,16, 1,13,14, 9,11,12, 3, 6, 0,
             5,15,14, 9, 0,11,13, 4, 3,12, 8,10, 7, 1, 6, 2,16,
             10,11,14, 5,15, 2,13, 6, 1, 3, 4, 7,12,16, 9, 0, 8,
             13, 2,12,14, 8, 0, 7, 4, 9,15,11, 6, 3,16, 1, 5,10,
             13, 5, 2,16, 6, 0, 4, 8,10, 1, 3,14, 9,11,12, 7,15,
             8,12, 3,14, 0, 4,16, 6, 2,11, 1, 7, 9,15,13, 5,10,
             16,14, 2, 9, 7, 4,11, 0, 6,12,10, 8, 1,13,15, 5, 3,
             5,16, 6, 1,10,15,11, 3, 8,14, 2,12, 0, 7, 9, 4,13,
             4,12, 8,10,14, 9, 6,11,15, 0, 3,13, 7, 2, 5,16, 1,
             10,14,16,11,12, 2, 5, 3, 4, 7, 0, 1, 6, 9,13, 8,15,
             2, 1,11,15,16, 6,12, 3,10,13, 8, 5, 4, 0, 7, 9,14,
             15,14, 1, 5,16, 2,12, 8, 9, 6,11,10, 3, 0, 7, 4,13,
             1, 5,10, 9, 2, 8, 4,16, 7,11, 3,12, 6,14, 0,13,15,
             3, 7,16,10,13, 2, 6, 8,14,15,12,11, 0, 9, 1, 4, 5,
             15, 2,14, 8, 5,16, 3,13, 4, 1, 9,12,10, 0, 6, 7,11,
             14,12,11,15,16,10, 2, 9, 8, 4, 3, 1,13, 5, 7, 0, 6,
             7,13, 2,11,14, 0, 1, 8, 9,10,16, 4, 6,12, 5, 3,15,
             16, 1,11, 4, 3, 9, 6,13, 5, 7,10,15,14,12, 2, 0, 8,
             0, 5, 2,10,16,12, 6, 3,11,14, 1, 9, 7,15, 4, 8,13,
             8,13,11, 4,10, 6, 7,16, 5,12, 9,14, 2, 3, 0,15, 1,
             9, 6,12,16, 4, 7, 3, 0, 2,15,13, 8,11,14, 5,10, 1,
             8,12, 0, 6,15, 7, 4,13,14,10, 1, 9, 5, 3,11, 2,16,
             5,15, 9,10,16, 6,11, 0, 7,13, 8,14, 3, 4, 1,12, 2,
             15,14, 2, 9, 4,11, 7, 1, 6,10, 5, 0, 8,12,13,16, 3,
             15,16, 0,10, 3,12,11, 7, 1, 8, 6,13, 4, 5, 9, 2,14,
             15, 4, 7,13,14, 2, 9,10,16, 1,11,12, 8, 0, 3, 5, 6,
             15, 8,13, 0, 4, 7, 3,14, 5,12, 2, 9,10,11, 6,16, 1,
             0, 7, 5, 3, 1,14,16, 4, 2,15,12, 8,10, 6, 9,11,13,
             7, 6, 0,15,16,11, 8, 1, 5,12,13,14,10, 9, 3, 2, 4,
             14,16,10, 6, 4,13, 3, 1,15,12,11, 8, 9, 5, 0, 7, 2,
             9, 3, 5,15,10,11, 8, 7, 2,14, 6,13, 0, 4, 1,12,16,
             4, 6, 7,14, 5, 3,12, 1,13, 9,16, 2, 0,10, 8,11,15,
             + zpool create -f oss3pool draid1 cfg=17.nvl /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr
             + zpool list
             NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
             oss3pool 14,7G 612K 14,7G - 0% 0% 1.00x ONLINE -
             + zpool status
             pool: oss3pool
             state: ONLINE
             scan: none requested
             config:
            NAME STATE READ WRITE CKSUM
             oss3pool ONLINE 0 0 0
             draid1-0 ONLINE 0 0 0
             sdb ONLINE 0 0 0
             sdc ONLINE 0 0 0
             sdd ONLINE 0 0 0
             sde ONLINE 0 0 0
             sdf ONLINE 0 0 0
             sdg ONLINE 0 0 0
             sdh ONLINE 0 0 0
             sdi ONLINE 0 0 0
             sdj ONLINE 0 0 0
             sdk ONLINE 0 0 0
             sdl ONLINE 0 0 0
             sdm ONLINE 0 0 0
             sdn ONLINE 0 0 0
             sdo ONLINE 0 0 0
             sdp ONLINE 0 0 0
             sdq ONLINE 0 0 0
             sdr ONLINE 0 0 0
             spares
             $draid1-0-s0 AVAIL 
             $draid1-0-s1 AVAIL
            errors: No known data errors
             + mount
             + grep oss3pool
             oss3pool on /oss3pool type zfs (rw,xattr,noacl)
             + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01
            Permanent disk data:
             Target: ZFS01-OST0003
             Index: 3
             Lustre FS: ZFS01
             Mount type: zfs
             Flags: 0x42
             (OST update )
             Persistent mount opts: 
             Parameters: mgsnode=172.17.32.220@tcp
            mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01
             Writing oss3pool/ZFS01 properties
             lustre:version=1
             lustre:flags=66
             lustre:index=3
             lustre:fsname=ZFS01
             lustre:svname=ZFS01-OST0003
             lustre:mgsnode=172.17.32.220@tcp
             + '[' -d /lustre/ZFS01/. ']'
             + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01
             arg[0] = /sbin/mount.lustre
             arg[1] = -v
             arg[2] = -o
             arg[3] = rw
             arg[4] = oss3pool/ZFS01
             arg[5] = /lustre/ZFS01
             source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01
             options = rw
             checking for existing Lustre data: found
             Writing oss3pool/ZFS01 properties
             lustre:version=1
             lustre:flags=2
             lustre:index=3
             lustre:fsname=ZFS01
             lustre:svname=ZFS01-OST0003
             lustre:mgsnode=172.17.32.220@tcp
             mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01
            
            

             
            Here it hangs.

          jno jno (Inactive) added a comment - Well,  It's the entier system (VM) hangs (crashes), not a mount, an app, or a session. Not every time I'm lucky enough to see that panic on the console. Usually it just hangs.   There are  collect-info.sh - the script used to collect debug info debug_info.20170424_042703.274836974_0400-4268-node26.zip - collected info before the crash debug_info.20170424_043500.551470248_0400-3183-node26.zip - collected info after the crash - console at crash scrolled up - console at crash scrolled down   [root@node26 ~]# ./mkzpool.sh + zpool list + grep -w 'no pools available' + zpool destroy oss3pool + zpool list + grep -w 'no pools available' no pools available + '[' -f 17.nvl ']' + draidcfg -r 17.nvl dRAID1 vdev of 17 child drives: 3 x (4 data + 1 parity) and 2 distributed spare Using 32 base permutations 15, 2, 8, 7,10, 5, 4,16, 1,13,14, 9,11,12, 3, 6, 0, 5,15,14, 9, 0,11,13, 4, 3,12, 8,10, 7, 1, 6, 2,16, 10,11,14, 5,15, 2,13, 6, 1, 3, 4, 7,12,16, 9, 0, 8, 13, 2,12,14, 8, 0, 7, 4, 9,15,11, 6, 3,16, 1, 5,10, 13, 5, 2,16, 6, 0, 4, 8,10, 1, 3,14, 9,11,12, 7,15, 8,12, 3,14, 0, 4,16, 6, 2,11, 1, 7, 9,15,13, 5,10, 16,14, 2, 9, 7, 4,11, 0, 6,12,10, 8, 1,13,15, 5, 3, 5,16, 6, 1,10,15,11, 3, 8,14, 2,12, 0, 7, 9, 4,13, 4,12, 8,10,14, 9, 6,11,15, 0, 3,13, 7, 2, 5,16, 1, 10,14,16,11,12, 2, 5, 3, 4, 7, 0, 1, 6, 9,13, 8,15, 2, 1,11,15,16, 6,12, 3,10,13, 8, 5, 4, 0, 7, 9,14, 15,14, 1, 5,16, 2,12, 8, 9, 6,11,10, 3, 0, 7, 4,13, 1, 5,10, 9, 2, 8, 4,16, 7,11, 3,12, 6,14, 0,13,15, 3, 7,16,10,13, 2, 6, 8,14,15,12,11, 0, 9, 1, 4, 5, 15, 2,14, 8, 5,16, 3,13, 4, 1, 9,12,10, 0, 6, 7,11, 14,12,11,15,16,10, 2, 9, 8, 4, 3, 1,13, 5, 7, 0, 6, 7,13, 2,11,14, 0, 1, 8, 9,10,16, 4, 6,12, 5, 3,15, 16, 1,11, 4, 3, 9, 6,13, 5, 7,10,15,14,12, 2, 0, 8, 0, 5, 2,10,16,12, 6, 3,11,14, 1, 9, 7,15, 4, 8,13, 8,13,11, 4,10, 6, 7,16, 5,12, 9,14, 2, 3, 0,15, 1, 9, 6,12,16, 4, 7, 3, 0, 2,15,13, 8,11,14, 5,10, 1, 8,12, 0, 6,15, 7, 4,13,14,10, 1, 9, 5, 3,11, 2,16, 5,15, 9,10,16, 6,11, 0, 7,13, 8,14, 3, 4, 1,12, 2, 15,14, 2, 9, 4,11, 7, 1, 6,10, 5, 0, 8,12,13,16, 3, 15,16, 0,10, 3,12,11, 7, 1, 8, 6,13, 4, 5, 9, 2,14, 15, 4, 7,13,14, 2, 9,10,16, 1,11,12, 8, 0, 3, 5, 6, 15, 8,13, 0, 4, 7, 3,14, 5,12, 2, 9,10,11, 6,16, 1, 0, 7, 5, 3, 1,14,16, 4, 2,15,12, 8,10, 6, 9,11,13, 7, 6, 0,15,16,11, 8, 1, 5,12,13,14,10, 9, 3, 2, 4, 14,16,10, 6, 4,13, 3, 1,15,12,11, 8, 9, 5, 0, 7, 2, 9, 3, 5,15,10,11, 8, 7, 2,14, 6,13, 0, 4, 1,12,16, 4, 6, 7,14, 5, 3,12, 1,13, 9,16, 2, 0,10, 8,11,15, + zpool create -f oss3pool draid1 cfg=17.nvl /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr + zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT oss3pool 14,7G 612K 14,7G - 0% 0% 1.00x ONLINE - + zpool status pool: oss3pool state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM oss3pool ONLINE 0 0 0 draid1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 sdg ONLINE 0 0 0 sdh ONLINE 0 0 0 sdi ONLINE 0 0 0 sdj ONLINE 0 0 0 sdk ONLINE 0 0 0 sdl ONLINE 0 0 0 sdm ONLINE 0 0 0 sdn ONLINE 0 0 0 sdo ONLINE 0 0 0 sdp ONLINE 0 0 0 sdq ONLINE 0 0 0 sdr ONLINE 0 0 0 spares $draid1-0-s0 AVAIL $draid1-0-s1 AVAIL errors: No known data errors + mount + grep oss3pool oss3pool on /oss3pool type zfs (rw,xattr,noacl) + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01 Permanent disk data: Target: ZFS01-OST0003 Index: 3 Lustre FS: ZFS01 Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=172.17.32.220@tcp mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01 Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=66 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp + '[' -d /lustre/ZFS01/. ']' + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = oss3pool/ZFS01 arg[5] = /lustre/ZFS01 source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01 options = rw checking for existing Lustre data: found Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=2 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01   Here it hangs.

          Is the mount hanging or the system hanging? Can you provide the following for each node:

          1. cat collect_info.sh
            Now=$(date +%Y%m%d_%H%M%S |tr '\n' '';echo -n "$$";echo $HOSTNAME)
            Dir="debug_info.$Now"

          mkdir -p "$Dir"

          ./show_kernelmod_params.sh > "$Dir"/OUTPUT.show_kernelmod_params.txt
          cat /sys/kernel/debug/tracing/trace > "$Dir"/OUTPUT.kernel_debug_trace.txt
          dmesg > "$Dir"/OUTPUT.dmesg.txt 2>&1
          zpool events > "$Dir"/OUTPUT.zpool_events.txt 2>&1
          zpool events -v > "$Dir"/OUTPUT.zpool_events_verbose.txt 2>&1
          lctl dl > "$Dir"/OUTPUT.lctl_dl.txt 2>&1
          lctl dk > "$Dir"/OUTPUT.lctl_dk.txt 2>&1
          cp /var/log/messages "$Dir"/OUTPUT.messages

          jsalians_intel John Salinas (Inactive) added a comment - Is the mount hanging or the system hanging? Can you provide the following for each node: cat collect_info.sh Now=$(date +%Y%m%d_%H%M%S |tr '\n' ' ';echo -n "$$ ";echo $HOSTNAME) Dir="debug_info.$Now" mkdir -p "$Dir" ./show_kernelmod_params.sh > "$Dir"/OUTPUT.show_kernelmod_params.txt cat /sys/kernel/debug/tracing/trace > "$Dir"/OUTPUT.kernel_debug_trace.txt dmesg > "$Dir"/OUTPUT.dmesg.txt 2>&1 zpool events > "$Dir"/OUTPUT.zpool_events.txt 2>&1 zpool events -v > "$Dir"/OUTPUT.zpool_events_verbose.txt 2>&1 lctl dl > "$Dir"/OUTPUT.lctl_dl.txt 2>&1 lctl dk > "$Dir"/OUTPUT.lctl_dk.txt 2>&1 cp /var/log/messages "$Dir"/OUTPUT.messages
          jno jno (Inactive) added a comment - - edited

          Ok, but why it hangs??

          PS. Yes, EADDRINUSE gone, but it still hangs:

          + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01
          Permanent disk data:
          Target: ZFS01-OST0003
          Index: 3
          Lustre FS: ZFS01
          Mount type: zfs
          Flags: 0x42
           (OST update )
          Persistent mount opts: 
          Parameters: mgsnode=172.17.32.220@tcp
          mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01
          Writing oss3pool/ZFS01 properties
           lustre:version=1
           lustre:flags=66
           lustre:index=3
           lustre:fsname=ZFS01
           lustre:svname=ZFS01-OST0003
           lustre:mgsnode=172.17.32.220@tcp
          + '[' -d /lustre/ZFS01/. ']'
          + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01
          arg[0] = /sbin/mount.lustre
          arg[1] = -v
          arg[2] = -o
          arg[3] = rw
          arg[4] = oss3pool/ZFS01
          arg[5] = /lustre/ZFS01
          source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01
          options = rw
          checking for existing Lustre data: found
          Writing oss3pool/ZFS01 properties
           lustre:version=1
           lustre:flags=2
           lustre:index=3
           lustre:fsname=ZFS01
           lustre:svname=ZFS01-OST0003
           lustre:mgsnode=172.17.32.220@tcp
          mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01
          

           

          jno jno (Inactive) added a comment - - edited Ok, but why it hangs?? PS. Yes, EADDRINUSE gone, but it still hangs: + mkfs.lustre --reformat --replace --ost --backfstype=zfs --fsname=ZFS01 --index=3 --mgsnode=mgs@tcp0 oss3pool/ZFS01 Permanent disk data: Target: ZFS01-OST0003 Index: 3 Lustre FS: ZFS01 Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=172.17.32.220@tcp mkfs_cmd = zfs create -o canmount=off -o xattr=sa oss3pool/ZFS01 Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=66 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp + '[' -d /lustre/ZFS01/. ']' + mount -v -t lustre oss3pool/ZFS01 /lustre/ZFS01 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = oss3pool/ZFS01 arg[5] = /lustre/ZFS01 source = oss3pool/ZFS01 (oss3pool/ZFS01), target = /lustre/ZFS01 options = rw checking for existing Lustre data: found Writing oss3pool/ZFS01 properties lustre:version=1 lustre:flags=2 lustre:index=3 lustre:fsname=ZFS01 lustre:svname=ZFS01-OST0003 lustre:mgsnode=172.17.32.220@tcp mounting device oss3pool/ZFS01 at /lustre/ZFS01, flags=0x1000000 options=osd=osd-zfs,,mgsnode=172.17.32.220@tcp,update,param=mgsnode=172.17.32.220@tcp,svname=ZFS01-OST0003,device=oss3pool/ZFS01  

          The "address already in use" message means that you have previously formatted an OST with the same index for this filesystem. You could use a new index, or use the --replace option to re-use the same index.

          adilger Andreas Dilger added a comment - The "address already in use" message means that you have previously formatted an OST with the same index for this filesystem. You could use a new index, or use the --replace option to re-use the same index.

          People

            wc-triage WC Triage
            jno jno (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: