Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17953

sanity test_60a: MDS crash during umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.15.5
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/375e5764-84b4-41ec-8cf2-b109216ea063

      test_60a failed with the following error:

      onyx-99vm1 crashed during sanity test_60a
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-b2_15/88 - 5.14.0-427.18.1.el9_4.x86_64
      servers: https://build.whamcloud.com/job/lustre-b2_15/88 - 4.18.0-513.24.1.el8_lustre.x86_64

      <<Please provide additional information about the failure here>>

      [ 4362.533695] LustreError: 171557:0:(libcfs_fail.h:169:cfs_race()) cfs_race id 1317 sleeping
      [ 4363.083782] LustreError: 171577:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking
      [ 4363.085558] LustreError: 171557:0:(libcfs_fail.h:178:cfs_race()) cfs_fail_race id 1317 awake: rc=4450
      [ 4364.107775] LustreError: 171577:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking
      [ 4364.109524] Lustre: 171577:0:(llog_test.c:1470:cat_check_old_cb()) seeing record at index 3 - [0x1:0x4be:0x0] in log [0xa:0x11:0x0]
      [ 4364.688147] LustreError: 171557:0:(libcfs_fail.h:180:cfs_race()) cfs_fail_race id 1317 waking
      [ 4365.708742] Lustre: 171557:0:(llog_test.c:2075:llog_test_10()) 10h: wrote 64767 records then 0 failed with ENOSPC
      [ 4365.710822] Lustre: 171557:0:(llog_test.c:2088:llog_test_10()) 10: put newly-created catalog
      [ 4366.910031] Lustre: DEBUG MARKER: /usr/sbin/lctl dk
      [ 4368.148192] Lustre: DEBUG MARKER: which llog_reader 2> /dev/null
      [ 4368.502516] Lustre: DEBUG MARKER: ls -d /usr/sbin/llog_reader
      [ 4369.709382] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [ 4370.032161] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
      [ 4372.532113] Lustre: Failing over lustre-MDT0000
      [ 4375.225434] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.251@tcp (stopping)
      [ 4378.342533] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.252@tcp (stopping)
      [ 4387.945411] Lustre: 11292:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718047804/real 1718047804]  req@000000009650aa81 x1801499268665600/t0(0) o400->lustre-OST0004-osc-MDT0000@10.240.29.253@tcp:28/4 lens 224/224 e 0 to 1 dl 1718047811 ref 1 fl Rpc:RXNQ/0/ffffffff rc 0/-1 job:'kworker/u4:0.0'
      [ 4387.955128] Lustre: lustre-OST0004-osc-MDT0000: Connection to lustre-OST0004 (at 10.240.29.253@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 4387.958383] Lustre: Skipped 2 previous similar messages
      [ 4387.959857] Lustre: lustre-MDT0000: Not available for connect from 10.240.29.251@tcp (stopping)
      [ 4387.961600] Lustre: Skipped 15 previous similar messages
      [ 4388.020779] Lustre: lustre-OST0004-osc-MDT0000: Connection restored to 10.240.29.253@tcp (at 10.240.29.253@tcp)
      [ 4388.271300] Lustre: MGS: Client 6a7488a2-5141-4026-8a48-15608255ca8e (at 10.240.29.251@tcp) reconnecting
      [ 4389.853159] LustreError: 172062:0:(lprocfs_jobstats.c:137:job_stat_exit()) should not have any items
      [ 4389.855119] LustreError: 172062:0:(lprocfs_jobstats.c:137:job_stat_exit()) Skipped 6 previous similar messages
      [ 4390.265347] LustreError: 172069:0:(client.c:1256:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  req@000000007642e2d0 x1801499268667136/t0(0) o101->lustre-MDT0000-lwp-MDT0000@0@lo:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:QU/0/ffffffff rc 0/-1 job:'qsd_reint_0.lus.0'
      [ 4390.270035] LustreError: 172069:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-MDT0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x10000:0x0], rc:-5
      [ 4392.020862] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.251@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 4392.024378] LustreError: Skipped 8 previous similar messages
      [ 4393.271120] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.253@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 4397.062969] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.240.29.251@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 4397.066497] LustreError: Skipped 7 previous similar messages
      [ 4398.474282] Lustre: 172062:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1718047819/real 1718047819]  req@000000007bbd2132 x1801499268667584/t0(0) o251->MGC10.240.28.44@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1718047825 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:'umount.0'
      [ 4398.479856] Lustre: 172062:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      [ 4427.970012] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [umount:172062]
      [ 4427.971533] Modules linked in: obdecho(OE) ptlrpc_gss(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) dm_flakey dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev pcspkr virtio_balloon i2c_piix4 sunrpc ext4 mbcache jbd2 ata_generic ata_piix libata crc32c_intel serio_raw virtio_net net_failover virtio_blk failover [last unloaded: llog_test]
      [ 4427.996085] CPU: 0 PID: 172062 Comm: umount Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_lustre.x86_64 #1
      [ 4427.998467] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 4427.999615] RIP: 0010:memset_erms+0x9/0x20
      [ 4428.147666] Code: 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66
      [ 4428.151228] RSP: 0018:ffffa9f9c133bb30 EFLAGS: 00010246 ORIG_RAX: ffffffffffffff13
      [ 4428.152710] RAX: ffff96b543ea525a RBX: ffff96b580f3fc00 RCX: 0000000000100000
      [ 4428.154121] RDX: 0000000000100000 RSI: 000000000000005a RDI: ffff96b579100000
      [ 4428.155527] RBP: ffff96b580f3fc28 R08: 0000000000000001 R09: ffff96b579100000
      [ 4428.156937] R10: ffff96b5787d5000 R11: 0000000000000001 R12: ffff96b580f3fc00
      [ 4428.158342] R13: ffff96b5787d5800 R14: dead000000000200 R15: dead000000000100
      [ 4428.171578] FS:  00007fc8771da080(0000) GS:ffff96b5ffc00000(0000) knlGS:0000000000000000
      [ 4428.173156] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 4428.174301] CR2: 00007f6943409050 CR3: 000000001bc80001 CR4: 00000000001706f0
      [ 4428.175710] Call Trace:
      [ 4428.252421]  <IRQ>
      [ 4428.254032]  ? watchdog_timer_fn.cold.10+0x46/0x9e
      [ 4428.278435]  ? watchdog+0x30/0x30
      [ 4428.279153]  ? __hrtimer_run_queues+0x101/0x280
      [ 4428.280620]  ? hrtimer_interrupt+0x100/0x220
      [ 4428.281501]  ? smp_apic_timer_interrupt+0x6a/0x130
      [ 4428.289409]  ? apic_timer_interrupt+0xf/0x20
      [ 4428.290302]  </IRQ>
      [ 4428.290773]  ? memset_erms+0x9/0x20
      [ 4428.291510]  ptlrpc_service_purge_all+0x422/0xa80 [ptlrpc]
      [ 4428.293182]  ptlrpc_unregister_service+0x422/0x940 [ptlrpc]
      [ 4428.294389]  ? kmem_cache_alloc_trace+0x142/0x280
      [ 4428.303771]  ? lprocfs_counter_add+0xd2/0x140 [obdclass]
      [ 4428.305122]  mds_stop_ptlrpc_service+0x69/0x1b0 [mdt]
      [ 4428.322721]  mds_device_fini+0x28/0xd0 [mdt]
      [ 4428.323661]  class_cleanup+0x6f5/0xc90 [obdclass]
      [ 4428.392742]  class_process_config+0x3ad/0x2080 [obdclass]
      [ 4428.393908]  ? class_manual_cleanup+0x191/0x780 [obdclass]
      [ 4428.395093]  ? __kmalloc+0x113/0x250
      [ 4428.402142]  class_manual_cleanup+0x456/0x780 [obdclass]
      [ 4428.403265]  server_put_super+0xc8b/0x1350 [obdclass]
      [ 4428.404575]  ? evict_inodes+0x160/0x1b0
      [ 4428.405889]  generic_shutdown_super+0x6c/0x110
      [ 4428.433075]  kill_anon_super+0x14/0x30
      [ 4428.433877]  deactivate_locked_super+0x34/0x70
      [ 4428.446805]  cleanup_mnt+0x3b/0x70
      [ 4428.448698]  task_work_run+0x8a/0xb0
      [ 4428.485783]  exit_to_usermode_loop+0xef/0x100
      [ 4428.542958]  do_syscall_64+0x19c/0x1b0
      [ 4428.543778]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
      [ 4428.560793] RIP: 0033:0x7fc876144e9b
      [ 4428.561572] Code: ff d0 48 89 c7 b8 3c 00 00 00 0f 05 48 8b 0d e4 4f 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd 4f 38 00 f7 d8 64 89 01 48
      [ 4428.565120] RSP: 002b:00007fff58347e08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 4428.566604] RAX: 0000000000000000 RBX: 0000564708b059c0 RCX: 00007fc876144e9b
      [ 4428.568005] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000564708b0b390
      [ 4428.569417] RBP: 0000000000000000 R08: 0000564708b0bf20 R09: 0000564708b00010
      [ 4428.570820] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564708b0b390
      [ 4428.572230] R13: 00007fc876fb6184 R14: 0000564708b05ba0 R15: 00000000ffffffff
      [ 4428.573821] Kernel panic - not syncing: softlockup: hung tasks
      [ 4428.574998] CPU: 0 PID: 172062 Comm: umount Kdump: loaded Tainted: G           OEL   --------- -  - 4.18.0-513.24.1.el8_lustre.x86_64 #1
      [ 4428.577378] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [ 4428.578518] Call Trace:
      [ 4428.582866]  <IRQ>
      [ 4428.583355]  dump_stack+0x41/0x60
      [ 4428.604195]  panic+0xe7/0x2ac
      [ 4428.606288]  watchdog_timer_fn.cold.10+0x85/0x9e
      [ 4428.622294]  ? watchdog+0x30/0x30
      [ 4428.623035]  __hrtimer_run_queues+0x101/0x280
      [ 4428.635497]  hrtimer_interrupt+0x100/0x220
      [ 4428.636677]  smp_apic_timer_interrupt+0x6a/0x130
      [ 4428.650956]  apic_timer_interrupt+0xf/0x20
      [ 4428.651834]  </IRQ>
      [ 4428.652312] RIP: 0010:memset_erms+0x9/0x20
      [ 4428.653162] Code: 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 66
      [ 4428.656714] RSP: 0018:ffffa9f9c133bb30 EFLAGS: 00010246 ORIG_RAX: ffffffffffffff13
      [ 4428.658216] RAX: ffff96b543ea525a RBX: ffff96b580f3fc00 RCX: 0000000000100000
      [ 4428.659631] RDX: 0000000000100000 RSI: 000000000000005a RDI: ffff96b579100000
      [ 4428.661036] RBP: ffff96b580f3fc28 R08: 0000000000000001 R09: ffff96b579100000
      [ 4428.677284] R10: ffff96b5787d5000 R11: 0000000000000001 R12: ffff96b580f3fc00
      [ 4428.678708] R13: ffff96b5787d5800 R14: dead000000000200 R15: dead000000000100
      [ 4428.680117]  ptlrpc_service_purge_all+0x422/0xa80 [ptlrpc]
      [ 4428.681308]  ptlrpc_unregister_service+0x422/0x940 [ptlrpc]
      [ 4428.682506]  ? kmem_cache_alloc_trace+0x142/0x280
      [ 4428.683465]  ? lprocfs_counter_add+0xd2/0x140 [obdclass]
      [ 4428.684603]  mds_stop_ptlrpc_service+0x69/0x1b0 [mdt]
      [ 4428.685667]  mds_device_fini+0x28/0xd0 [mdt]
      [ 4428.686591]  class_cleanup+0x6f5/0xc90 [obdclass]
      [ 4428.687609]  class_process_config+0x3ad/0x2080 [obdclass]
      [ 4428.688752]  ? class_manual_cleanup+0x191/0x780 [obdclass]
      [ 4428.689910]  ? __kmalloc+0x113/0x250
      [ 4428.690678]  class_manual_cleanup+0x456/0x780 [obdclass]
      [ 4428.691816]  server_put_super+0xc8b/0x1350 [obdclass]
      [ 4428.692893]  ? evict_inodes+0x160/0x1b0
      [ 4428.693699]  generic_shutdown_super+0x6c/0x110
      [ 4428.694611]  kill_anon_super+0x14/0x30
      [ 4428.695401]  deactivate_locked_super+0x34/0x70
      [ 4428.696311]  cleanup_mnt+0x3b/0x70
      [ 4428.697034]  task_work_run+0x8a/0xb0
      [ 4428.697791]  exit_to_usermode_loop+0xef/0x100
      [ 4428.698688]  do_syscall_64+0x19c/0x1b0
      [ 4428.699482]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
      [ 4428.700517] RIP: 0033:0x7fc876144e9b
      [ 4428.701276] Code: ff d0 48 89 c7 b8 3c 00 00 00 0f 05 48 8b 0d e4 4f 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd 4f 38 00 f7 d8 64 89 01 48
      [ 4428.704828] RSP: 002b:00007fff58347e08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 4428.706314] RAX: 0000000000000000 RBX: 0000564708b059c0 RCX: 00007fc876144e9b
      [ 4428.746088] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000564708b0b390
      [ 4428.747498] RBP: 0000000000000000 R08: 0000564708b0bf20 R09: 0000564708b00010
      [ 4428.748908] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564708b0b390
      [ 4428.750319] R13: 00007fc876fb6184 R14: 0000564708b05ba0 R15: 00000000ffffffff
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      sanity test_60a - onyx-99vm1 crashed during sanity test_60a

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: