Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11556

conf-sanity test 32b crashes on MDT umount with “ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.1, Lustre 2.12.4
    • None
    • master clients with 2.11.0 or 2.10.5 servers
    • 3
    • 9223372036854775807

    Description

      conf-sanity test_32b crashes for master clients and 2.10.5 servers. Looking at the client test_log at https://testing.whamcloud.com/test_sets/b1b83d10-cedf-11e8-9238-52540065bddc, we see that the test crashes during umount of mdt1

      CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
      sed: -e expression #1, char 9: unknown option to `s'
      pdsh@onyx-37vm9: no remote hosts specified
      CMD: /usr/sbin/lctl get_param -n mdc.t32fs-MDT0000*.max_rpcs_in_flight 14
      sed: -e expression #1, char 9: unknown option to `s'
      pdsh@onyx-37vm9: no remote hosts specified
      CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt
      CMD: onyx-37vm12 umount -d /tmp/t32/mnt/mdt1
      

      Looking at the MDS (vm12) console log, we see

      [23682.075589] Lustre: DEBUG MARKER: cat /tmp/t32/list2
      [23682.368852] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23682.888492] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23682.889529] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Skipped 7 previous similar messages
      [23688.214366] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n osd*.*.force_sync=1
      [23688.389563] LustreError: 15627:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000
      [23688.557868] Lustre: DEBUG MARKER: test -f /tmp/t32/sha1sums
      [23688.885389] Lustre: DEBUG MARKER: cat /tmp/t32/sha1sums
      [23690.182337] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
      [23690.182337] 				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
      [23690.182337] 				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
      [23690.526084] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param t32fs-MDT0000.mdc.max_rpcs_in_flight='14'
      [23690.690187] Lustre: Modifying parameter t32fs-MDT0000-mdc.mdc.max_rpcs_in_flight in log t32fs-client
      [23690.691354] Lustre: Skipped 6 previous similar messages
      [23691.038484] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt
      [23691.209348] Lustre: Failing over t32fs-MDT0000
      [23691.377221] Lustre: server umount t32fs-MDT0000 complete
      [23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
      [23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
      [23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{
      [23692.002188] 
      [23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)
      [23692.003646] 
      [23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)
      [23692.005001] 
      [23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0
      [23692.006540] 
      [23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280
      [23692.007729] 
      [23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360
      [23692.008942] 
      [23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{
      [23692.009948] 
      [23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)
      [23692.011273] 
      [23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)
      [23692.012624] 
      [23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60
      [23692.014163] 
      [23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]
      [23692.015351] 
      [23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510
      [23692.017030] 
      [23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_de[    0.000000] Initializing cgroup subsys cpuset
      [    0.000000] Initializing cgroup subsys cpu
      

      Looking at the kernel crash log, we can see the call trace

      [23691.678261] Lustre: DEBUG MARKER: umount -d /tmp/t32/mnt/mdt1
      [23691.844378] LustreError: 11-0: t32fs-MDT0000-lwp-MDT0001: operation mds_disconnect to node 0@lo failed: rc = -107
      [23692.002188] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6989360[0x0, 1, [0x240000401:0x1:0x0] hash exist]{
      
      [23692.003646] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb69893b0mdt-object@ffff89bbb6989360( , writecount=0)
      
      [23692.005001] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bb9d517870mdd-object@ffff89bb9d517870(open_count=0, valid=0, cltime=0, flags=0)
      
      [23692.006540] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbe46fd5b0lod-object@ffff89bbe46fd5b0
      
      [23692.007729] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osp@ffff89bbe218f2d0osp-object@ffff89bbe218f280
      
      [23692.008942] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6989360
      
      [23692.009948] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) header@ffff89bbb6843510[0x0, 1, [0x280000401:0x1:0x0] hash exist]{
      
      [23692.011273] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdt@ffff89bbb6843560mdt-object@ffff89bbb6843510( , writecount=0)
      
      [23692.012624] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....mdd@ffff89bbe47c2e60mdd-object@ffff89bbe47c2e60(open_count=0, valid=0, cltime=0, flags=0)
      
      [23692.014163] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....lod@ffff89bbf9a19b60lod-object@ffff89bbf9a19b60
      
      [23692.015351] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) ....osd-ldiskfs@ffff89bbf93f8400osd-ldiskfs-object@ffff89bbf93f8400(i:ffff89bbdb2b9a38:25056/936018527)[plain]
      
      [23692.017030] LustreError: 11816:0:(osp_dev.c:1277:osp_device_free()) } header@ffff89bbb6843510
      
      [23692.025094] LustreError: 11816:0:(lu_object.c:1177:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
      [23692.025214] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: 
      [23692.025215] LustreError: 17697:0:(mdt_handler.c:4808:mdt_fini()) LBUG
      [23692.025217] Pid: 17697, comm: umount 3.10.0-862.9.1.el7_lustre.x86_64 #1 SMP Mon Aug 27 17:48:12 UTC 2018
      [23692.025217] Call Trace:
      [23692.025249]  [<ffffffffc09287cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [23692.025254]  [<ffffffffc092887c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [23692.025266]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
      [23692.025302]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
      [23692.025318]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
      [23692.025332]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      [23692.025353]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
      [23692.025376]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
      [23692.025378]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
      [23692.025393]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
      [23692.025395]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
      [23692.025397]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
      [23692.025402]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
      [23692.025404]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
      [23692.025417]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
      [23692.025426]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
      [23692.025440]  [<ffffffffaa120ad8>] int_signal+0x12/0x17
      [23692.025460]  [<ffffffffffffffff>] 0xffffffffffffffff
      [23692.025463] Kernel panic - not syncing: LBUG
      [23692.025469] CPU: 1 PID: 17697 Comm: umount Kdump: loaded Tainted: G           OE  ------------   3.10.0-862.9.1.el7_lustre.x86_64 #1
      [23692.025469] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [23692.025470] Call Trace:
      [23692.025478]  [<ffffffffaa10e84e>] dump_stack+0x19/0x1b
      [23692.025480]  [<ffffffffaa108b50>] panic+0xe8/0x21f
      [23692.025486]  [<ffffffffc09288cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [23692.025493]  [<ffffffffc13d2872>] mdt_device_fini+0x8f2/0x930 [mdt]
      [23692.025508]  [<ffffffffc0afc4b7>] class_cleanup+0x987/0xce0 [obdclass]
      [23692.025522]  [<ffffffffc0afe83f>] class_process_config+0x19bf/0x2420 [obdclass]
      [23692.025529]  [<ffffffffc0933bd7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
      [23692.025543]  [<ffffffffc0aff466>] class_manual_cleanup+0x1c6/0x710 [obdclass]
      [23692.025558]  [<ffffffffc0b2c6de>] server_put_super+0x8de/0xcd0 [obdclass]
      [23692.025561]  [<ffffffffa9c1debd>] generic_shutdown_super+0x6d/0x100
      [23692.025563]  [<ffffffffa9c1e2a2>] kill_anon_super+0x12/0x20
      [23692.025577]  [<ffffffffc0b01eb2>] lustre_kill_super+0x32/0x50 [obdclass]
      [23692.025578]  [<ffffffffa9c1e65e>] deactivate_locked_super+0x4e/0x70
      [23692.025580]  [<ffffffffa9c1ede6>] deactivate_super+0x46/0x60
      [23692.025581]  [<ffffffffa9c3cd4f>] cleanup_mnt+0x3f/0x80
      [23692.025583]  [<ffffffffa9c3cde2>] __cleanup_mnt+0x12/0x20
      [23692.025584]  [<ffffffffa9ab803b>] task_work_run+0xbb/0xe0
      [23692.025586]  [<ffffffffa9a2ac55>] do_notify_resume+0xa5/0xc0
      [23692.025588]  [<ffffffffaa120ad8>] int_signal+0x12/0x17
      

      We’ve seen conf-sanity test 32b crash for 2.11.0 servers at https://testing.whamcloud.com/test_sets/fd4e6636-d44e-11e8-9238-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: