Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15364

Kernel oops when stripe on Arm64 Server end multiple MDTs

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.15.0
    • None
    • Arm64, v8.0, virtual machine, All in one node.
    • 3
    • 9223372036854775807

    Description

       All in one Arm64 node, with configuration setup as below:

       

      mds_HOST="lustre-build.novalocal"
      MDSCOUNT=2
      mds1_HOST=$mds_HOST
      MDSDEV1=/dev/vdb
      mds1_MOUNT=/mnt/mdtb
      mds2_HOST=$mds_HOST
      MDSDEV2=/dev/vdc
      mds2_MOUNT=/mnt/mdtc
      
      OSTCOUNT=2
      ost_HOST="lustre-build.novalocal"
      ost1_HOST=$mds_HOST
      OSTDEV1=/dev/vdd
      ost1_MOUNT=/mnt/ost
      bost2_HOST=$mds_HOST
      OSTDEV2=/dev/vde
      ost2_MOUNT=/mnt/ostc 

       

       

      Setup 2 MDTs in this node.

      After ./llmount.sh, then we run the command:

      lfs mkdir -i1 -c2 -H crush /mnt/lustre/test.sanity 

      We will see the kernel oops, the dmesg output like this:

      [67451.989655] Lustre: trans no 4294967299 committed transno 4294967299
      [67451.994582] Lustre: NRS stop fifo request from 12345-0@lo, seq: 37
      [67451.995916] Lustre: lustre-MDT0000-osp-MDT0001: committing for last_committed 4294967299 gen 2
      [67451.998986] Lustre: Completed RPC req@000000002a9095b6 pname:cluuid:pid:xid:nid:opc:job osp_up0-1:lustre-MDT0001-mdtlov_UUID:10524:1718654554184768:0@lo:1000:osp_up0-1.0
      [67452.002300] Lustre: ou 00000000edcca9a9 version 3 rpc_version 3
      [67452.003592] Lustre: Sending RPC req@00000000900684ad pname:cluuid:pid:xid:nid:opc:job osp_up0-1:lustre-MDT0001-mdtlov_UUID:10524:1718654554184832:0@lo:1000:osp_up0-1.0
      [67452.006792] Lustre: peer: 12345-0@lo (source: 12345-0@lo)
      [67452.007970] Lustre: set 000000003bee5318 going to sleep for 6 seconds
      [67452.007986] Lustre: got req x1718654554184832
      [67452.010276] Lustre: NRS start fifo request from 12345-0@lo, seq: 38
      [67452.011603] Lustre: Handling RPC req@0000000075731868 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out05_002:lustre-MDT0001-mdtlov_UUID+6:10524:x1718654554184832:12345-0@lo:1000:osp_up0-1.0
      [67452.021774] Lustre: lustre-MDT0000: transno 4294967300 is committed
      [67452.023272] Lustre: Handled RPC req@0000000075731868 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out05_002:lustre-MDT0001-mdtlov_UUID+6:10524:x1718654554184832:12345-0@lo:1000:osp_up0-1.0 Request processed in 11665us (16473us total) trans 4294967300 rc 0/0
      [67452.023284] Lustre: trans no 4294967300 committed transno 4294967300
      [67452.027833] Lustre: NRS stop fifo request from 12345-0@lo, seq: 38
      [67452.029139] Lustre: lustre-MDT0000-osp-MDT0001: committing for last_committed 4294967300 gen 2
      [67452.032292] Lustre: Completed RPC req@00000000900684ad pname:cluuid:pid:xid:nid:opc:job osp_up0-1:lustre-MDT0001-mdtlov_UUID:10524:1718654554184832:0@lo:1000:osp_up0-1.0
      [67452.035424] Lustre: ou 00000000edcca9a9 version 4 rpc_version 4
      [67452.036812] Lustre: Sending RPC req@00000000c14ec545 pname:cluuid:pid:xid:nid:opc:job osp_up0-1:lustre-MDT0001-mdtlov_UUID:10524:1718654554184896:0@lo:1000:osp_up0-1.0
      [67452.039895] Lustre: peer: 12345-0@lo (source: 12345-0@lo)
      [67452.041057] Lustre: set 0000000077eb37df going to sleep for 6 seconds
      [67452.041071] Lustre: got req x1718654554184896
      [67452.043452] Lustre: NRS start fifo request from 12345-0@lo, seq: 39
      [67452.044766] Lustre: Handling RPC req@000000008399fed6 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out05_002:lustre-MDT0001-mdtlov_UUID+6:10524:x1718654554184896:12345-0@lo:1000:osp_up0-1.0
      [67452.049965] Lustre: Handled RPC req@000000008399fed6 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out05_002:lustre-MDT0001-mdtlov_UUID+7:10524:x1718654554184896:12345-0@lo:1000:osp_up0-1.0 Request processed in 5184us (10052us total) trans 4294967301 rc 0/0
      [67452.050038] Lustre: Completed RPC req@00000000c14ec545 pname:cluuid:pid:xid:nid:opc:job osp_up0-1:lustre-MDT0001-mdtlov_UUID:10524:1718654554184896:0@lo:1000:osp_up0-1.0
      [67452.054980] Lustre: NRS stop fifo request from 12345-0@lo, seq: 39
      [67452.065583] Lustre: ### ldlm_lock_addref(PW) ns: mdt-lustre-MDT0001_UUID lock: 000000001c761af5/0xae74a97a4f2331dd lrc: 3/0,1 mode: --/PW res: [0x240000402:0x1:0x0].0x0 bits 0x0/0x0 rrc: 2 type: IBT gid 0 flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.070936] Lustre: ### About to add lock: ns: mdt-lustre-MDT0001_UUID lock: 000000001c761af5/0xae74a97a4f2331dd lrc: 3/0,1 mode: PW/PW res: [0x240000402:0x1:0x0].0x0 bits 0x2/0x0 rrc: 2 type: IBT gid 0 flags: 0x50210001000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.076701] Lustre: ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0001_UUID lock: 000000001c761af5/0xae74a97a4f2331dd lrc: 3/0,1 mode: PW/PW res: [0x240000402:0x1:0x0].0x0 bits 0x2/0x0 rrc: 2 type: IBT gid 0 flags: 0x40210001000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.082993] Lustre: ### ldlm_lock_addref(PW) ns: mdt-lustre-MDT0001_UUID lock: 00000000324d3623/0xae74a97a4f2331e4 lrc: 3/0,1 mode: --/PW res: [0x240000400:0x2:0x0].0x0 bits 0x0/0x0 rrc: 2 type: IBT gid 0 flags: 0x40000000000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.088849] Lustre: ### About to add lock: ns: mdt-lustre-MDT0001_UUID lock: 00000000324d3623/0xae74a97a4f2331e4 lrc: 3/0,1 mode: PW/PW res: [0x240000400:0x2:0x0].0x0 bits 0x2/0x0 rrc: 2 type: IBT gid 0 flags: 0x50210001000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.094616] Lustre: ### client-side local enqueue handler, new lock created ns: mdt-lustre-MDT0001_UUID lock: 00000000324d3623/0xae74a97a4f2331e4 lrc: 3/0,1 mode: PW/PW res: [0x240000400:0x2:0x0].0x0 bits 0x2/0x0 rrc: 2 type: IBT gid 0 flags: 0x40210001000000 nid: local remote: 0x0 expref: -99 pid: 10246 timeout: 0 lvb_type: 0
      [67452.100790] Unable to handle kernel paging request at virtual address ffffb6d6a5c60804
      [67452.102606] Mem abort info:
      [67452.103219]   ESR = 0x96000021
      [67452.103865]   Exception class = DABT (current EL), IL = 32 bits
      [67452.105141]   SET = 0, FnV = 0
      [67452.105816]   EA = 0, S1PTW = 0
      [67452.106492] Data abort info:
      [67452.107096]   ISV = 0, ISS = 0x00000021
      [67452.107912]   CM = 0, WnR = 0
      [67452.108564] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 000000008ce20289
      [67452.110150] [ffffb6d6a5c60804] pgd=000000083ffd0003, pud=000000083ffd0003, pmd=000000083ff30003, pte=00e8000165c60f13
      [67452.112534] Internal error: Oops: 96000021 [#1] SMP
      [67452.113564] Modules linked in: lustre(OE) ofd(OE) osp(OE) lod(OE) ost(OE) mdt(OE) mdd(OE) mgs(OE) osd_ldiskfs(OE) ldiskfs(OE) mbcache jbd2 lquota(OE) lfsck(OE) obdecho(OE) mgc(OE) mdc(OE) lov(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) crc32_generic libcfs(OE) dm_flakey vfat fat virtio_gpu crct10dif_ce drm_kms_helper ghash_ce sha2_ce drm sha256_arm64 fb_sys_fops syscopyarea sysfillrect sha1_ce sysimgblt virtio_balloon binfmt_misc xfs libcrc32c virtio_net net_failover virtio_blk failover virtio_mmio sunrpc dm_mirror dm_region_hash dm_log dm_mod
      [67452.124678] CPU: 3 PID: 10246 Comm: mdt01_002 Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-348.2.1.el8_lustre_debug_debug.aarch64 #1
      [67452.127604] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [67452.129116] pstate: 10000005 (nzcV daif -PAN -UAO)
      [67452.130187] pc : __ll_sc_atomic64_or+0x4/0x18
      [67452.131178] lr : lod_object_lock+0x81c/0x15c0 [lod]
      [67452.131804] Lustre: Sending RPC req@00000000880f234c pname:cluuid:pid:xid:nid:opc:job ptlrpcd_06_00:lustre-MDT0001-mdtlov_UUID:8379:1718654554184960:0@lo:41:osp-pre-0-1.0
      [67452.132273] sp : ffffb6d68a5a7270
      [67452.135807] Lustre: peer: 12345-0@lo (source: 12345-0@lo)
      [67452.136435] x29: ffffb6d68a5a72b0 x28: ffff20000ae70280
      [67452.137672] Lustre: got req x1718654554184960
      [67452.138704] x27: ffff20007517e888 x26: 0000000000000001
      [67452.139760] Lustre: NRS start fifo request from 12345-0@lo, seq: 67
      [67452.140824] x25: 0000000000000001 x24: 0000000000000001
      [67452.142263] Lustre: Handling RPC req@00000000ce02a9f7 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out06_002:lustre-MDT0001-mdtlov_UUID+7:8379:x1718654554184960:12345-0@lo:41:osp-pre-0-1.0
      [67452.143359] x23: ffffb6d80566a150 x22: ffffb6d6a766e150
      [67452.146813] Lustre: blocks cached 0 granted 2146304 pending 0 free 126251008 avail 114745344
      [67452.147920] x21: ffffb6d6bc9ca7d0 x20: ffffb6d8056415c8
      [67452.149808] Lustre: Handled RPC req@00000000ce02a9f7 pname:cluuid+ref:pid:xid:nid:opc:job mdt_out06_002:lustre-MDT0001-mdtlov_UUID+7:8379:x1718654554184960:12345-0@lo:41:osp-pre-0-1.0 Request processed in 7546us (13999us total) trans 0 rc 0/0
      [67452.149851] Lustre: Completed RPC req@00000000880f234c pname:cluuid:pid:xid:nid:opc:job ptlrpcd_06_00:lustre-MDT0001-mdtlov_UUID:8379:1718654554184960:0@lo:41:osp-pre-0-1.0
      [67452.150828] x19: 0000000000000008 x18: 0000000000000000
      [67452.155565] Lustre: NRS stop fifo request from 12345-0@lo, seq: 67
      [67452.158845] x17: 0000000000000000 x16: ffff200072dfc718
      [67452.162462] x15: dfff200000000000 x14: 636f6c203a64696e
      [67452.163623] x13: 0000000000000000 x12: ffff16dad4ecdc2e
      [67452.164801] x11: 1ffff6dad4ecdc2d x10: ffff16dad4ecdc2d
      [67452.165982] x9 : 0000000000000000 x8 : 0000000000000000
      [67452.167180] x7 : 1ffff6db00acd42a x6 : ffff16dad4ecdc2e
      [67452.168351] x5 : ffffb6d6a766e158 x4 : 0000000000000000
      [67452.169529] x3 : 0000000000000000 x2 : 0000000000000000
      [67452.170696] x1 : ffffb6d6a5c60804 x0 : 0000000000000002
      [67452.171910] Process mdt01_002 (pid: 10246, stack limit = 0x000000006659ab27)
      [67452.173530] Call trace:
      [67452.174090]  __ll_sc_atomic64_or+0x4/0x18
      [67452.174994]  mdd_object_lock+0xac/0x170 [mdd]
      [67452.175992]  mdt_reint_striped_lock+0x494/0xf10 [mdt]
      [67452.177185]  mdt_create+0x23c8/0x4818 [mdt]
      [67452.178150]  mdt_reint_create+0x6c4/0xbb8 [mdt]
      [67452.179201]  mdt_reint_rec+0x27c/0x708 [mdt]
      [67452.180176]  mdt_reint_internal+0xbd4/0x2408 [mdt]
      [67452.181292]  mdt_reint+0x190/0x378 [mdt]
      [67452.182314]  tgt_handle_request0+0x238/0x1368 [ptlrpc]
      [67452.183587]  tgt_request_handle+0x1364/0x3ec0 [ptlrpc]
      [67452.184872]  ptlrpc_server_handle_request+0x9ec/0x28d0 [ptlrpc]
      [67452.186329]  ptlrpc_main+0x1aa4/0x3f68 [ptlrpc]
      [67452.187371]  kthread+0x3b0/0x460
      [67452.188119]  ret_from_fork+0x10/0x18
      [67452.188955] Code: f84107fe d65f03c0 d503201f f9800031 (c85f7c31)
      [67452.190637] SMP: stopping secondary CPUs
      [67452.200205] Starting crashdump kernel...
      [67452.201386] Bye! 

       

      Attachments

        Issue Links

          Activity

            People

              kevin.zhao Kevin Zhao
              kevin.zhao Kevin Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: