Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11832

ARM servers crashing on MDS startup

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.14.0
    • Lustre 2.12.0
    • 3
    • 9223372036854775807

    Description

      I am trying to incorporate some centos7 ARM server testing into my setup and I am having crashes on MDS mount.

      If I have selinux enabled, it oopses like this:

      [  617.809020] Unable to handle kernel NULL pointer dereference at virtual address 00000000
      [  617.809487] Mem abort info:
      [  617.809701]   Exception class = DABT (current EL), IL = 32 bits
      [  617.809968]   SET = 0, FnV = 0
      [  617.810146]   EA = 0, S1PTW = 0
      [  617.810312] Data abort info:
      [  617.810463]   ISV = 0, ISS = 0x00000007
      [  617.810665]   CM = 0, WnR = 0
      [  617.810864] user pgtable: 64k pages, 48-bit VAs, pgd = ffff8000c76d9200
      [  617.811221] [0000000000000000] *pgd=00000000a1370003, *pud=00000000a1370003, *pmd=00000000a1e90003, *pte=0000000000000000
      [  617.812422] Internal error: Oops: 96000007 [#1] SMP
      [  617.812842] Modules linked in: loop dm_flakey dm_mod lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) ext4 mbcache jbd2 sunrpc vfat fat crc32_ce ghash_ce sha2_ce sha256_arm64 sha1_ce sg virtio_rng ip_tables xfs libcrc32c virtio_scsi virtio_net virtio_blk virtio_console virtio_pci virtio_mmio virtio_ring virtio
      [  617.815784] CPU: 0 PID: 5095 Comm: mount.lustre Tainted: G           OE  ------------   4.14.0 #1
      [  617.816218] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
      [  617.816626] task: ffff8000c8f5dc00 task.stack: ffff0000211a0000
      [  617.817343] PC is at selinux_file_permission+0x68/0x154
      [  617.817631] LR is at selinux_file_permission+0x68/0x154
      [  617.817893] pc : [<ffff0000083614e0>] lr : [<ffff0000083614e0>] pstate: 60000005
      [  617.818238] sp : ffff0000211af380
      [  617.818406] x29: ffff0000211af380 x28: ffff000000b81000 
      [  617.818723] x27: ffff8000dc101000 x26: ffff000000b81004 
      [  617.819006] x25: ffff0000014c0440 x24: ffff000008d13c08 
      [  617.819273] x23: 0000000000000893 x22: 0000000000000000 
      [  617.819537] x21: ffff8000ddee1280 x20: 0000000000000004 
      [  617.819811] x19: ffff8000c3229248 x18: 0000ffff917be400 
      [  617.820104] x17: 0000000000000000 x16: ffff8000c8f5dc00 
      [  617.820380] x15: 000000000284cc40 x14: 0000000000000000 
      [  617.820647] x13: 0000000000000b88 x12: 0000000000000018 
      [  617.820911] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
      [  617.821189] x9 : 0000000000000000 x8 : ffff000008d13c08 
      [  617.821500] x7 : 0000000000000040 x6 : 7460e9b027bc9900 
      [  617.821846] x5 : 0000000000010000 x4 : 000000000000f450 
      [  617.822159] x3 : ffff000000b81000 x2 : 0000000000000000 
      [  617.822428] x1 : 0000000000000000 x0 : 0000000000000000 
      [  617.822787] Process mount.lustre (pid: 5095, stack limit = 0xffff0000211a0000)
      [  617.823305] Call trace:
      [  617.823657] Exception stack(0xffff0000211af240 to 0xffff0000211af380)
      [  617.824175] f240: 0000000000000000 0000000000000000 0000000000000000 ffff000000b81000
      [  617.824586] f260: 000000000000f450 0000000000010000 7460e9b027bc9900 0000000000000040
      [  617.824946] f280: ffff000008d13c08 0000000000000000 7f7f7f7f7f7f7f7f 0101010101010101
      [  617.825309] f2a0: 0000000000000018 0000000000000b88 0000000000000000 000000000284cc40
      [  617.825667] f2c0: ffff8000c8f5dc00 0000000000000000 0000ffff917be400 ffff8000c3229248
      [  617.826057] f2e0: 0000000000000004 ffff8000ddee1280 0000000000000000 0000000000000893
      [  617.826497] f300: ffff000008d13c08 ffff0000014c0440 ffff000000b81004 ffff8000dc101000
      [  617.826884] f320: ffff000000b81000 ffff0000211af380 ffff0000083614e0 ffff0000211af380
      [  617.827322] f340: ffff0000083614e0 0000000060000005 ffff8000c3229248 0000000000000004
      [  617.827729] f360: 0001000000000000 0000000000000000 ffff0000211af380 ffff0000083614e0
      [  617.828421] [<ffff0000083614e0>] selinux_file_permission+0x68/0x154
      [  617.828765] [<ffff000008356848>] security_file_permission+0x58/0xf8
      [  617.829110] [<ffff0000082b1798>] iterate_dir+0x44/0x1b8
      [  617.830573] [<ffff000001e529f0>] osd_ios_general_scan+0xf8/0x2b0 [osd_ldiskfs]
      [  617.831760] [<ffff000001e5b8d4>] osd_initial_OI_scrub+0x9c/0x13e0 [osd_ldiskfs]
      [  617.832909] [<ffff000001e5daac>] osd_scrub_setup+0xb44/0x1118 [osd_ldiskfs]
      [  617.833977] [<ffff000001e2d4ec>] osd_device_alloc+0x544/0x950 [osd_ldiskfs]
      [  617.836078] [<ffff000000eb9d9c>] class_setup+0x7bc/0xd20 [obdclass]
      [  617.838397] [<ffff000000ec3a20>] class_process_config+0x1708/0x2e90 [obdclass]
      [  617.840457] [<ffff000000eca358>] do_lcfg+0x2b0/0x6d8 [obdclass]
      [  617.842867] [<ffff000000ecf48c>] lustre_start_simple+0x154/0x3f8 [obdclass]
      [  617.844903] [<ffff000000f04ed0>] osd_start+0x500/0xa40 [obdclass]
      [  617.847245] [<ffff000000f10a64>] server_fill_super+0x1d4/0x1848 [obdclass]
      [  617.849294] [<ffff000000ed3794>] lustre_fill_super+0x62c/0xdb0 [obdclass]
      [  617.849680] [<ffff0000082a02b4>] mount_nodev+0x5c/0xbc
      [  617.852008] [<ffff000000ecadb4>] lustre_mount+0x4c/0x80 [obdclass]
      [  617.852371] [<ffff0000082a12f8>] mount_fs+0x54/0x16c
      [  617.852627] [<ffff0000082bfb40>] vfs_kern_mount+0x58/0x154
      [  617.852886] [<ffff0000082c2fcc>] do_mount+0x1cc/0xbac
      [  617.853191] [<ffff0000082c3d34>] SyS_mount+0x88/0xd4
      [  617.853463] Exception stack(0xffff0000211afec0 to 0xffff0000211b0000)
      [  617.853771] fec0: 00000000315a0050 0000ffffd141a180 000000000040e098 0000000001000000
      [  617.854155] fee0: 00000000315a0070 0000000000000bd0 0000ffff93f0add4 0000000000000000
      [  617.854520] ff00: 0000000000000028 1999999999999999 00000000ffffffff 0000000000000005
      [  617.854871] ff20: 0000000000000005 ffffffffffffffff 000000008408bd8e 0000001ffa1dea16
      [  617.855231] ff40: 0000ffff93fa0000 00000000004301d0 0000ffffd1413860 0000ffffd1417168
      [  617.855588] ff60: 0000ffffd14171a0 0000000000000000 00000000315a0070 0000000000000000
      [  617.855980] ff80: 000000000042f000 00000000fffffff5 0000ffffd141c168 000000000042f000
      [  617.856356] ffa0: 0000ffffd1413e40 0000ffffd1413a90 0000000000404868 0000ffffd1413a90
      [  617.856746] ffc0: 0000ffff93fa0008 0000000080000000 00000000315a0050 0000000000000028
      [  617.857121] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [  617.857501] [<ffff00000808359c>] __sys_trace_return+0x0/0x4
      [  617.857947] Code: d2800001 aa1503e0 52800022 97fff630 (b94002c0) 
      [  617.858923] ---[ end trace 007561cc33cd3443 ]---
      [  617.859326] Kernel panic - not syncing: Fatal exception
      [  617.859686] SMP: stopping secondary CPUs
      [  617.860166] Kernel Offset: disabled
      [  617.860311] CPU features: 0x1802082
      [  617.860452] Memory Limit: none
      [  617.860660] ---[ end Kernel panic - not syncing: Fatal exception
      

      this is not suposed to happen, since we were handling selinux in the past.

      The other problem is once I disable selinux it then hangs on MDS mount:

      [  243.391052] INFO: task mount.lustre:2636 blocked for more than 120 seconds.
      [  243.393134]       Tainted: G           OE  ------------   4.14.0 #1
      [  243.394963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  243.399802] mount.lustre    D    0  2636   2635 0x00000224
      [  243.401896] Call trace:
      [  243.403266] [<ffff000008085a8c>] __switch_to+0x8c/0xa8
      [  243.404571] [<ffff0000088154d0>] __schedule+0x328/0x860
      [  243.406277] [<ffff000008815a3c>] schedule+0x34/0x8c
      [  243.407956] [<ffff000008818f60>] rwsem_down_write_failed+0x134/0x238
      [  243.410015] [<ffff00000881839c>] down_write+0x54/0x58
      [  243.411962] [<ffff00000281b390>] osd_ios_root_fill+0xd0/0x578 [osd_ldiskfs]
      [  243.415804] [<ffff000002681798>] call_filldir+0xd8/0x148 [ldiskfs]
      [  243.419054] [<ffff000002682170>] ldiskfs_readdir+0x670/0x7b8 [ldiskfs]
      [  243.420477] [<ffff0000082b18a4>] iterate_dir+0x150/0x1b8
      [  243.421835] [<ffff0000028129f0>] osd_ios_general_scan+0xf8/0x2b0 [osd_ldiskfs]
      [  243.423755] [<ffff00000281b8d4>] osd_initial_OI_scrub+0x9c/0x13e0 [osd_ldiskfs]
      [  243.425517] [<ffff00000281daac>] osd_scrub_setup+0xb44/0x1118 [osd_ldiskfs]
      [  243.427166] [<ffff0000027ed4ec>] osd_device_alloc+0x544/0x950 [osd_ldiskfs]
      [  243.428993] [<ffff000001b79d9c>] class_setup+0x7bc/0xd20 [obdclass]
      [  243.430611] [<ffff000001b83a20>] class_process_config+0x1708/0x2e90 [obdclass]
      [  243.432616] [<ffff000001b8a358>] do_lcfg+0x2b0/0x6d8 [obdclass]
      [  243.434258] [<ffff000001b8f48c>] lustre_start_simple+0x154/0x3f8 [obdclass]
      [  243.436161] [<ffff000001bc4ed0>] osd_start+0x500/0xa40 [obdclass]
      [  243.438178] [<ffff000001bd0a64>] server_fill_super+0x1d4/0x1848 [obdclass]
      [  243.440867] [<ffff000001b93794>] lustre_fill_super+0x62c/0xdb0 [obdclass]
      [  243.443388] [<ffff0000082a02b4>] mount_nodev+0x5c/0xbc
      [  243.445407] [<ffff000001b8adb4>] lustre_mount+0x4c/0x80 [obdclass]
      [  243.447436] [<ffff0000082a12f8>] mount_fs+0x54/0x16c
      [  243.449257] [<ffff0000082bfb40>] vfs_kern_mount+0x58/0x154
      [  243.456371] [<ffff0000082c2fcc>] do_mount+0x1cc/0xbac
      [  243.458503] [<ffff0000082c3d34>] SyS_mount+0x88/0xd4
      [  243.460257] Exception stack(0xffff00001022fec0 to 0xffff000010230000)
      [  243.461894] fec0: 00000000057a0030 0000ffffd373c5b0 000000000040e098 0000000001000000
      [  243.464401] fee0: 00000000057a0050 0000000000000bd0 0000ffff8746add4 0000000000000000
      [  243.466183] ff00: 0000000000000028 1999999999999999 00000000ffffffff 0000000000000005
      [  243.467956] ff20: 0000000000000005 ffffffffffffffff 0000000098866d56 00000024f08e838c
      [  243.469742] ff40: 0000ffff87500000 00000000004301d0 0000ffffd3735c90 0000ffffd3739598
      [  243.471617] ff60: 0000ffffd37395d0 0000000000000000 00000000057a0050 0000000000000000
      [  243.474134] ff80: 000000000042f000 00000000fffffff5 0000ffffd373e598 000000000042f000
      [  243.476517] ffa0: 0000ffffd3736270 0000ffffd3735ec0 0000000000404868 0000ffffd3735ec0
      [  243.478290] ffc0: 0000ffff87500008 0000000080000000 00000000057a0030 0000000000000028
      [  243.480078] ffe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [  243.481970] [<ffff00000808359c>] __sys_trace_return+0x0/0x4
      

      I am using patch from LU-11200 to enable server side/ldiskfs building.

      Attachments

        Issue Links

          Activity

            People

              simmonsja James A Simmons
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: