Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11633

MDS0 crash when migrate stripe dir from MDS0 to MDS1

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      lustre-master build 3813, tag-2.11.56_55_g4afee32 . ldiskfs
      The system has configured as:
      MDS1 - MDT1 and MDT2
      MDS2 - MDT3 and MDT4
      OSS0 - OST
      client

      test steps:
      1. create stripe dir on MDS1 cross MDT1 and MDT2
      lfs mkdir -c 2 -i 0 /mnt/lustre/test #on client
      2. create many files under test dir
      createmany -m /mnt/lustre/test/fxxx #on client
      3. migrate the test dir to MDT3 and MDT4 on MDS1; while on the same time, umount MDT3 and MDT4
      lfs migrate -m 2,3 /mnt/lustre/test #on client
      umount /mnt/mds3 #on MDS2
      umount /mnt/mds4 #on MDS2
      4. MDS1 crash right away
      5. client migration hung

      MDS1 log

      [root@trevis-60vm4 ~]# [17465.128984] ------------[ cut here ]------------
      [17465.129586] WARNING: CPU: 0 PID: 5988 at lib/list_debug.c:33 __list_add+0xac/0xc0
      [17465.130385] list_add corruption. prev->next should be next (ffffb2ac00c91018), but was 0000000500040028. (prev=ffff975ad2ba1030).
      [17465.131574] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ldiskfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 i2c_core auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy ata_piix libata 8139too serio_raw 8139cp mii virtio_pci virtio_ring virtio
      [17465.141406] CPU: 0 PID: 5988 Comm: mdt00_000 Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.14.4.el7_lustre.x86_64 #1
      [17465.142647] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [17465.143240] Call Trace:
      [17465.143520]  [<ffffffffb3313754>] dump_stack+0x19/0x1b
      [17465.144075]  [<ffffffffb2c945d8>] __warn+0xd8/0x100
      [17465.144599]  [<ffffffffb2c9465f>] warn_slowpath_fmt+0x5f/0x80
      [17465.145203]  [<ffffffffb2f6d2dc>] __list_add+0xac/0xc0
      [17465.145982]  [<ffffffffc0858bd6>] lu_object_put+0x1d6/0x3d0 [obdclass]
      [17465.146774]  [<ffffffffc0e81740>] mdt_getattr_name_lock+0x1440/0x1c30 [mdt]
      [17465.147501]  [<ffffffffc0e8bb6c>] ? old_init_ucred_common+0xdc/0x2e0 [mdt]
      [17465.148235]  [<ffffffffc0e86eb4>] mdt_getattr_name+0xc4/0x2b0 [mdt]
      [17465.149107]  [<ffffffffc0af817a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [17465.149943]  [<ffffffffb2c6a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [17465.150623]  [<ffffffffc06f6bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [17465.151433]  [<ffffffffc0a9ec2b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [17465.152266]  [<ffffffffc0a9bab5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [17465.153053]  [<ffffffffb2cd2022>] ? default_wake_function+0x12/0x20
      [17465.153719]  [<ffffffffb2cc7c4b>] ? __wake_up_common+0x5b/0x90
      [17465.154439]  [<ffffffffc0aa252c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [17465.155114]  [<ffffffffb2ccc7f0>] ? finish_task_switch+0x50/0x170
      [17465.155839]  [<ffffffffc0aa1a30>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc]
      [17465.156623]  [<ffffffffb2cbdf21>] kthread+0xd1/0xe0
      [17465.157192]  [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40
      [17465.157848]  [<ffffffffb33255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [17465.158579]  [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40
      [17465.159205] ---[ end trace 056fdb3898b88da3 ]---
      [17465.202810] BUG: unable to handle kernel paging request at 0000000200000612
      [17465.203585] IP: [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc]
      [17465.204356] PGD 0 
      [17465.204599] Oops: 0000 [#1] SMP 
      [17465.204988] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ldiskfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 i2c_core auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy ata_piix libata 8139too serio_raw 8139cp mii virtio_pci virtio_ring virtio
      [17465.214377] CPU: 0 PID: 5988 Comm: mdt00_000 Kdump: loaded Tainted: G        W  OE  ------------   3.10.0-862.14.4.el7_lustre.x86_64 #1
      [17465.215602] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [17465.216203] task: ffff975af5d93f40 ti: ffff975b37e80000 task.ti: ffff975b37e80000
      [17465.216980] RIP: 0010:[<ffffffffc0a4263a>]  [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc]
      [17465.217972] RSP: 0018:ffff975b37e837b0  EFLAGS: 00010246
      [17465.218517] RAX: 0000000200000402 RBX: ffff975adc5300c0 RCX: ffff975b37e83fd8
      [17465.219245] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff975ad00d5790
      [17465.219971] RBP: ffff975b37e837e8 R08: 0000000000021d20 R09: ffff975ad00d5680
      [17465.220695] R10: 0000000000000005 R11: 00000002c0000404 R12: ffff975b37e83858
      [17465.221426] R13: ffff975acf364400 R14: 000000000000000d R15: ffff975ad00d5680
      [17465.222154] FS:  0000000000000000(0000) GS:ffff975b3fc00000(0000) knlGS:0000000000000000
      [17465.222980] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [17465.223566] CR2: 0000000200000612 CR3: 000000004ce0e000 CR4: 00000000000606f0
      [17465.224301] Call Trace:
      [17465.224593]  [<ffffffffc0a61843>] ldlm_cli_enqueue+0xc3/0x920 [ptlrpc]
      [17465.225295]  [<ffffffffc0a61205>] ? ldlm_prep_elc_req+0x125/0x490 [ptlrpc]
      [17465.226044]  [<ffffffffc0a5cc20>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
      [17465.226864]  [<ffffffffc0e70920>] ? mdt_obd_reconnect+0x1e0/0x1e0 [mdt]
      [17465.227568]  [<ffffffffc1078452>] osp_md_object_lock+0x162/0x2d0 [osp]
      [17465.228291]  [<ffffffffc0f91293>] lod_object_lock+0xf3/0x7b0 [lod]
      [17465.228974]  [<ffffffffc100dcee>] mdd_object_lock+0x3e/0xe0 [mdd]
      [17465.229613]  [<ffffffffc0e7e281>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt]
      [17465.230388]  [<ffffffffc0a9092c>] ? lustre_msg_get_flags+0x2c/0xa0 [ptlrpc]
      [17465.231115]  [<ffffffffc0e7e81a>] mdt_remote_object_lock+0x2a/0x30 [mdt]
      [17465.231821]  [<ffffffffc0e9bfa1>] mdt_reint_migrate_internal+0x1151/0x1350 [mdt]
      [17465.232615]  [<ffffffffc08384b9>] ? class_handle_hash+0xb9/0x2f0 [obdclass]
      [17465.233362]  [<ffffffffc0a754a0>] ? ldlm_inodebits_compat_queue+0x480/0x480 [ptlrpc]
      [17465.234175]  [<ffffffffc0e9c405>] mdt_reint_rename_or_migrate.isra.51+0x265/0x860 [mdt]
      [17465.235024]  [<ffffffffc0a629d0>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
      [17465.235837]  [<ffffffffc0a5cc20>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc]
      [17465.236658]  [<ffffffffc0e9ca10>] mdt_reint_migrate+0x10/0x20 [mdt]
      [17465.237314]  [<ffffffffc0e9dee3>] mdt_reint_rec+0x83/0x210 [mdt]
      [17465.237948]  [<ffffffffc0e7b112>] mdt_reint_internal+0x6b2/0xa80 [mdt]
      [17465.238632]  [<ffffffffc0e86387>] mdt_reint+0x67/0x140 [mdt]
      [17465.239255]  [<ffffffffc0af817a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [17465.239971]  [<ffffffffb2c6a14e>] ? kvm_clock_get_cycles+0x1e/0x20
      [17465.240613]  [<ffffffffc06f6bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [17465.241368]  [<ffffffffc0a9ec2b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [17465.242189]  [<ffffffffc0a9bab5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [17465.242894]  [<ffffffffb2cd2022>] ? default_wake_function+0x12/0x20
      [17465.243539]  [<ffffffffb2cc7c4b>] ? __wake_up_common+0x5b/0x90
      [17465.244169]  [<ffffffffc0aa252c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [17465.244824]  [<ffffffffb2ccc7f0>] ? finish_task_switch+0x50/0x170
      [17465.245488]  [<ffffffffc0aa1a30>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc]
      [17465.246251]  [<ffffffffb2cbdf21>] kthread+0xd1/0xe0
      [17465.246764]  [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40
      [17465.247385]  [<ffffffffb33255f7>] ret_from_fork_nospec_begin+0x21/0x21
      [17465.248061]  [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40
      [17465.248687] Code: 49 8d 87 10 02 00 00 49 c7 87 90 00 00 00 00 00 00 00 ba 01 00 00 00 49 89 87 10 02 00 00 49 89 87 18 02 00 00 48 8b 03 48 8b 00 <48> 8b b8 10 02 00 00 e8 1a 57 df ff 4d 89 3f 48 c7 c6 f0 e8 b6 
      [17465.251906] RIP  [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc]
      [17465.252662]  RSP <ffff975b37e837b0>
      [17465.253026] CR2: 0000000200000612
      [    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      [    0.000000] Initializing cgroup subsys cpuacct
      [    0.000000] Linux version 3.10.0-862.14.4.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Oct 12 14:51:33 UTC 2018
      [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7_lustre.x86_64 ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr disable_cpu_apicid=0 elfcorehdr=867700K
      [    0.000000] e820: BIOS-provided physical RAM map:
      [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009f7ff] usable
      [    0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
      [    0.000000] BIOS-e820: [mem 0x0000000021000000-0x0000000034f5cfff] usable
      [    0.000000] BIOS-e820: [mem 0x0000000034fff800-0x0000000034ffffff] usable
      [    0.000000] BIOS-e820: [mem 0x000000007fffa000-0x000000007fffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
      [    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
      [    0.000000] NX (Execute Disable) protection: active
      [    0.000000] SMBIOS 2.4 present.
      [    0.000000] Hypervisor detected: KVM
      [    0.000000] e820: last_pfn = 0x35000 max_arch_pfn = 0x400000000
      [    0.000000] PAT configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- UC  
      [    0.000000] x2apic enabled by BIOS, switching to x2apic ops
      [    0.000000] found SMP MP-table at [mem 0x000f72e0-0x000f72ef] mapped at [ffffffffff2002e0]
      [    0.000000] Using GB pages for direct mapping
      [    0.000000] RAMDISK: [mem 0x321ea000-0x337fffff]
      
      

      client

      [root@trevis-60vm7 ~]# lfs migrate -m 2,3 /mnt/lustre/test/
      [17823.455189] Lustre: 6089:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1541550002/real 1541550002]  req@ffff92227859b300 x1616413745888192/t0(0) o36->lustre-MDT0000-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 728/624 e 0 to 1 dl 1541550009 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [17823.458255] Lustre: lustre-MDT0000-mdc-ffff9222b782c800: Connection to lustre-MDT0000 (at 10.9.6.157@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [17854.248785] Lustre: 4779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1541550034/real 1541550040]  req@ffff922278599e00 x1616413745888288/t0(0) o400->lustre-MDT0001-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 224/224 e 0 to 1 dl 1541550041 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [17854.251954] Lustre: lustre-MDT0001-mdc-ffff9222b782c800: Connection to lustre-MDT0001 (at 10.9.6.157@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [17854.253728] Lustre: Skipped 1 previous similar message
      [17854.254295] LustreError: 166-1: MGC10.9.6.157@tcp: Connection to MGS (at 10.9.6.157@tcp) was lost; in progress operations using this service will fail
      [17855.516636] Lustre: 4780:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1541550034/real 1541550034]  req@ffff922278598600 x1616413745888320/t0(0) o400->lustre-MDT0003-mdc-ffff9222b782c800@10.9.6.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1541550041 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [17855.519693] Lustre: 4780:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [17855.520691] Lustre: lustre-MDT0003-mdc-ffff9222b782c800: Connection to lustre-MDT0003 (at 10.9.6.158@tcp) was lost; in progress operations using this service will wait for recovery to complete
      

      6. after I remount all MDTs (1-4), client migration continue but with errors:

      [18614.098819] Lustre: lustre-MDT0003-mdc-ffff9222b782c800: Connection restored to 10.9.6.158@tcp (at 10.9.6.158@tcp)
      [18633.134556] Lustre: Evicted from MGS (at 10.9.6.157@tcp) after server handle changed from 0xf72204ebf2d9e0fb to 0x91bae19fcc49671c
      [18633.135986] Lustre: MGC10.9.6.157@tcp: Connection restored to 10.9.6.157@tcp (at 10.9.6.157@tcp)
      [18634.125083] LustreError: 4778:0:(client.c:3023:ptlrpc_replay_interpret()) @@@ status 301, old was 0  req@ffff922278599b00 x1616413745888112/t4295026907(4295026907) o101->lustre-MDT0000-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 640/560 e 0 to 0 dl 1541550827 ref 2 fl Interpret:R/4/0 rc 301/301
      [18634.541841] Lustre: lustre-MDT0001-mdc-ffff9222b782c800: Connection restored to 10.9.6.157@tcp (at 10.9.6.157@tcp)
      /mnt/lustre/test//58239 migrate failed: Stale file handle (-116)
      /mnt/lustre/test//71151 migrate failed: No data available (-61)
      

      7. "ls /mnt/lustre/test" can show the files; then I try to rm the whole dir and files, the files got removed, but return "dir not empty" error. "ls -al" shows "total 4444" but nothing under the dir

      root@trevis-60vm7 lustre]# rm -rf test/
      rm: cannot remove ‘test/’: Directory not empty
      [root@trevis-60vm7 lustre]# ls test/
      [root@trevis-60vm7 lustre]# ls -al test/
      total 4444
      drwxr-xr-x 2 root root 4538368 Nov  7 00:50 .
      drwxr-xr-x 4 root root    4096 Nov  7 00:20 ..
      [root@trevis-60vm7 lustre]# 
      

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: