[LU-11633] MDS0 crash when migrate stripe dir from MDS0 to MDS1 Created: 07/Nov/18 Updated: 12/Nov/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Sarah Liu | Assignee: | Lai Siyao |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
lustre-master build 3813, tag-2.11.56_55_g4afee32 . ldiskfs test steps: MDS1 log [root@trevis-60vm4 ~]# [17465.128984] ------------[ cut here ]------------ [17465.129586] WARNING: CPU: 0 PID: 5988 at lib/list_debug.c:33 __list_add+0xac/0xc0 [17465.130385] list_add corruption. prev->next should be next (ffffb2ac00c91018), but was 0000000500040028. (prev=ffff975ad2ba1030). [17465.131574] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ldiskfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 i2c_core auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy ata_piix libata 8139too serio_raw 8139cp mii virtio_pci virtio_ring virtio [17465.141406] CPU: 0 PID: 5988 Comm: mdt00_000 Kdump: loaded Tainted: G OE ------------ 3.10.0-862.14.4.el7_lustre.x86_64 #1 [17465.142647] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17465.143240] Call Trace: [17465.143520] [<ffffffffb3313754>] dump_stack+0x19/0x1b [17465.144075] [<ffffffffb2c945d8>] __warn+0xd8/0x100 [17465.144599] [<ffffffffb2c9465f>] warn_slowpath_fmt+0x5f/0x80 [17465.145203] [<ffffffffb2f6d2dc>] __list_add+0xac/0xc0 [17465.145982] [<ffffffffc0858bd6>] lu_object_put+0x1d6/0x3d0 [obdclass] [17465.146774] [<ffffffffc0e81740>] mdt_getattr_name_lock+0x1440/0x1c30 [mdt] [17465.147501] [<ffffffffc0e8bb6c>] ? old_init_ucred_common+0xdc/0x2e0 [mdt] [17465.148235] [<ffffffffc0e86eb4>] mdt_getattr_name+0xc4/0x2b0 [mdt] [17465.149107] [<ffffffffc0af817a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [17465.149943] [<ffffffffb2c6a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [17465.150623] [<ffffffffc06f6bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [17465.151433] [<ffffffffc0a9ec2b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [17465.152266] [<ffffffffc0a9bab5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [17465.153053] [<ffffffffb2cd2022>] ? default_wake_function+0x12/0x20 [17465.153719] [<ffffffffb2cc7c4b>] ? __wake_up_common+0x5b/0x90 [17465.154439] [<ffffffffc0aa252c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [17465.155114] [<ffffffffb2ccc7f0>] ? finish_task_switch+0x50/0x170 [17465.155839] [<ffffffffc0aa1a30>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc] [17465.156623] [<ffffffffb2cbdf21>] kthread+0xd1/0xe0 [17465.157192] [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40 [17465.157848] [<ffffffffb33255f7>] ret_from_fork_nospec_begin+0x21/0x21 [17465.158579] [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40 [17465.159205] ---[ end trace 056fdb3898b88da3 ]--- [17465.202810] BUG: unable to handle kernel paging request at 0000000200000612 [17465.203585] IP: [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc] [17465.204356] PGD 0 [17465.204599] Oops: 0000 [#1] SMP [17465.204988] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) ldiskfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core ppdev iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev virtio_balloon pcspkr parport_pc parport i2c_piix4 i2c_core auth_rpcgss sunrpc ip_tables ext4 mbcache jbd2 virtio_blk ata_generic pata_acpi crct10dif_pclmul crct10dif_common crc32c_intel floppy ata_piix libata 8139too serio_raw 8139cp mii virtio_pci virtio_ring virtio [17465.214377] CPU: 0 PID: 5988 Comm: mdt00_000 Kdump: loaded Tainted: G W OE ------------ 3.10.0-862.14.4.el7_lustre.x86_64 #1 [17465.215602] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [17465.216203] task: ffff975af5d93f40 ti: ffff975b37e80000 task.ti: ffff975b37e80000 [17465.216980] RIP: 0010:[<ffffffffc0a4263a>] [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc] [17465.217972] RSP: 0018:ffff975b37e837b0 EFLAGS: 00010246 [17465.218517] RAX: 0000000200000402 RBX: ffff975adc5300c0 RCX: ffff975b37e83fd8 [17465.219245] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff975ad00d5790 [17465.219971] RBP: ffff975b37e837e8 R08: 0000000000021d20 R09: ffff975ad00d5680 [17465.220695] R10: 0000000000000005 R11: 00000002c0000404 R12: ffff975b37e83858 [17465.221426] R13: ffff975acf364400 R14: 000000000000000d R15: ffff975ad00d5680 [17465.222154] FS: 0000000000000000(0000) GS:ffff975b3fc00000(0000) knlGS:0000000000000000 [17465.222980] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [17465.223566] CR2: 0000000200000612 CR3: 000000004ce0e000 CR4: 00000000000606f0 [17465.224301] Call Trace: [17465.224593] [<ffffffffc0a61843>] ldlm_cli_enqueue+0xc3/0x920 [ptlrpc] [17465.225295] [<ffffffffc0a61205>] ? ldlm_prep_elc_req+0x125/0x490 [ptlrpc] [17465.226044] [<ffffffffc0a5cc20>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [17465.226864] [<ffffffffc0e70920>] ? mdt_obd_reconnect+0x1e0/0x1e0 [mdt] [17465.227568] [<ffffffffc1078452>] osp_md_object_lock+0x162/0x2d0 [osp] [17465.228291] [<ffffffffc0f91293>] lod_object_lock+0xf3/0x7b0 [lod] [17465.228974] [<ffffffffc100dcee>] mdd_object_lock+0x3e/0xe0 [mdd] [17465.229613] [<ffffffffc0e7e281>] mdt_remote_object_lock_try+0x1e1/0x750 [mdt] [17465.230388] [<ffffffffc0a9092c>] ? lustre_msg_get_flags+0x2c/0xa0 [ptlrpc] [17465.231115] [<ffffffffc0e7e81a>] mdt_remote_object_lock+0x2a/0x30 [mdt] [17465.231821] [<ffffffffc0e9bfa1>] mdt_reint_migrate_internal+0x1151/0x1350 [mdt] [17465.232615] [<ffffffffc08384b9>] ? class_handle_hash+0xb9/0x2f0 [obdclass] [17465.233362] [<ffffffffc0a754a0>] ? ldlm_inodebits_compat_queue+0x480/0x480 [ptlrpc] [17465.234175] [<ffffffffc0e9c405>] mdt_reint_rename_or_migrate.isra.51+0x265/0x860 [mdt] [17465.235024] [<ffffffffc0a629d0>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [17465.235837] [<ffffffffc0a5cc20>] ? ldlm_expired_completion_wait+0x220/0x220 [ptlrpc] [17465.236658] [<ffffffffc0e9ca10>] mdt_reint_migrate+0x10/0x20 [mdt] [17465.237314] [<ffffffffc0e9dee3>] mdt_reint_rec+0x83/0x210 [mdt] [17465.237948] [<ffffffffc0e7b112>] mdt_reint_internal+0x6b2/0xa80 [mdt] [17465.238632] [<ffffffffc0e86387>] mdt_reint+0x67/0x140 [mdt] [17465.239255] [<ffffffffc0af817a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [17465.239971] [<ffffffffb2c6a14e>] ? kvm_clock_get_cycles+0x1e/0x20 [17465.240613] [<ffffffffc06f6bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [17465.241368] [<ffffffffc0a9ec2b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [17465.242189] [<ffffffffc0a9bab5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [17465.242894] [<ffffffffb2cd2022>] ? default_wake_function+0x12/0x20 [17465.243539] [<ffffffffb2cc7c4b>] ? __wake_up_common+0x5b/0x90 [17465.244169] [<ffffffffc0aa252c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [17465.244824] [<ffffffffb2ccc7f0>] ? finish_task_switch+0x50/0x170 [17465.245488] [<ffffffffc0aa1a30>] ? ptlrpc_register_service+0xf70/0xf70 [ptlrpc] [17465.246251] [<ffffffffb2cbdf21>] kthread+0xd1/0xe0 [17465.246764] [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40 [17465.247385] [<ffffffffb33255f7>] ret_from_fork_nospec_begin+0x21/0x21 [17465.248061] [<ffffffffb2cbde50>] ? insert_kthread_work+0x40/0x40 [17465.248687] Code: 49 8d 87 10 02 00 00 49 c7 87 90 00 00 00 00 00 00 00 ba 01 00 00 00 49 89 87 10 02 00 00 49 89 87 18 02 00 00 48 8b 03 48 8b 00 <48> 8b b8 10 02 00 00 e8 1a 57 df ff 4d 89 3f 48 c7 c6 f0 e8 b6 [17465.251906] RIP [<ffffffffc0a4263a>] ldlm_lock_create+0x1da/0xa40 [ptlrpc] [17465.252662] RSP <ffff975b37e837b0> [17465.253026] CR2: 0000000200000612 [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.10.0-862.14.4.el7_lustre.x86_64 (jenkins@trevis-307-el7-x8664-1.trevis.whamcloud.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC) ) #1 SMP Fri Oct 12 14:51:33 UTC 2018 [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.0-862.14.4.el7_lustre.x86_64 ro console=tty0 LANG=en_US.UTF-8 console=ttyS0,115200 net.ifnames=0 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr disable_cpu_apicid=0 elfcorehdr=867700K [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009f7ff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009f800-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000021000000-0x0000000034f5cfff] usable [ 0.000000] BIOS-e820: [mem 0x0000000034fff800-0x0000000034ffffff] usable [ 0.000000] BIOS-e820: [mem 0x000000007fffa000-0x000000007fffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] SMBIOS 2.4 present. [ 0.000000] Hypervisor detected: KVM [ 0.000000] e820: last_pfn = 0x35000 max_arch_pfn = 0x400000000 [ 0.000000] PAT configuration [0-7]: WB WC UC- UC WB WP UC- UC [ 0.000000] x2apic enabled by BIOS, switching to x2apic ops [ 0.000000] found SMP MP-table at [mem 0x000f72e0-0x000f72ef] mapped at [ffffffffff2002e0] [ 0.000000] Using GB pages for direct mapping [ 0.000000] RAMDISK: [mem 0x321ea000-0x337fffff] client [root@trevis-60vm7 ~]# lfs migrate -m 2,3 /mnt/lustre/test/ [17823.455189] Lustre: 6089:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1541550002/real 1541550002] req@ffff92227859b300 x1616413745888192/t0(0) o36->lustre-MDT0000-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 728/624 e 0 to 1 dl 1541550009 ref 2 fl Rpc:X/0/ffffffff rc 0/-1 [17823.458255] Lustre: lustre-MDT0000-mdc-ffff9222b782c800: Connection to lustre-MDT0000 (at 10.9.6.157@tcp) was lost; in progress operations using this service will wait for recovery to complete [17854.248785] Lustre: 4779:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1541550034/real 1541550040] req@ffff922278599e00 x1616413745888288/t0(0) o400->lustre-MDT0001-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 224/224 e 0 to 1 dl 1541550041 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [17854.251954] Lustre: lustre-MDT0001-mdc-ffff9222b782c800: Connection to lustre-MDT0001 (at 10.9.6.157@tcp) was lost; in progress operations using this service will wait for recovery to complete [17854.253728] Lustre: Skipped 1 previous similar message [17854.254295] LustreError: 166-1: MGC10.9.6.157@tcp: Connection to MGS (at 10.9.6.157@tcp) was lost; in progress operations using this service will fail [17855.516636] Lustre: 4780:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1541550034/real 1541550034] req@ffff922278598600 x1616413745888320/t0(0) o400->lustre-MDT0003-mdc-ffff9222b782c800@10.9.6.158@tcp:12/10 lens 224/224 e 0 to 1 dl 1541550041 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [17855.519693] Lustre: 4780:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 1 previous similar message [17855.520691] Lustre: lustre-MDT0003-mdc-ffff9222b782c800: Connection to lustre-MDT0003 (at 10.9.6.158@tcp) was lost; in progress operations using this service will wait for recovery to complete 6. after I remount all MDTs (1-4), client migration continue but with errors: [18614.098819] Lustre: lustre-MDT0003-mdc-ffff9222b782c800: Connection restored to 10.9.6.158@tcp (at 10.9.6.158@tcp) [18633.134556] Lustre: Evicted from MGS (at 10.9.6.157@tcp) after server handle changed from 0xf72204ebf2d9e0fb to 0x91bae19fcc49671c [18633.135986] Lustre: MGC10.9.6.157@tcp: Connection restored to 10.9.6.157@tcp (at 10.9.6.157@tcp) [18634.125083] LustreError: 4778:0:(client.c:3023:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff922278599b00 x1616413745888112/t4295026907(4295026907) o101->lustre-MDT0000-mdc-ffff9222b782c800@10.9.6.157@tcp:12/10 lens 640/560 e 0 to 0 dl 1541550827 ref 2 fl Interpret:R/4/0 rc 301/301 [18634.541841] Lustre: lustre-MDT0001-mdc-ffff9222b782c800: Connection restored to 10.9.6.157@tcp (at 10.9.6.157@tcp) /mnt/lustre/test//58239 migrate failed: Stale file handle (-116) /mnt/lustre/test//71151 migrate failed: No data available (-61) 7. "ls /mnt/lustre/test" can show the files; then I try to rm the whole dir and files, the files got removed, but return "dir not empty" error. "ls -al" shows "total 4444" but nothing under the dir root@trevis-60vm7 lustre]# rm -rf test/ rm: cannot remove ‘test/’: Directory not empty [root@trevis-60vm7 lustre]# ls test/ [root@trevis-60vm7 lustre]# ls -al test/ total 4444 drwxr-xr-x 2 root root 4538368 Nov 7 00:50 . drwxr-xr-x 4 root root 4096 Nov 7 00:20 .. [root@trevis-60vm7 lustre]# |
| Comments |
| Comment by Peter Jones [ 07/Nov/18 ] |
|
Lai Could you please advise? Thanks Peter |
| Comment by Lai Siyao [ 07/Nov/18 ] |
|
The first crash is in mdt_getattr_name(), while the second is in mdt_reint_migrate_internal(). Sarah, have you done such test for other operations on DNE before? and did you see similar crash? |
| Comment by Sarah Liu [ 12/Nov/18 ] |
|
Lai, According the DNE phase II test plan, following failure scenarios had been tested long time ago and I don't remember there are ticket similar to this one: Tests not involve migration: MDT0 is damaged DIR1 cross 4 MDTs, one of the MDT (Not MDT0) is damaged
DIR2 cross 3 MDTs, the irrelevant MDT (Not MDT0) is damaged Besides above, sanity_230c tested the migration with interruption case for single stripe dir |