Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14283

sanity-sec test 52 crashes due to ‘divide error: 0000 [#1] SMP PTI’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      sanity-sec test_52 crashes due to ‘divide error: 0000 1 SMP PTI’. We’ve seen this test crash five times for review-dne-ssk or review-dne-selinux-ssk patch testing. Two recent crashes happened while testing RHEL8.3 clients and servers:
      https://testing.whamcloud.com/test_sets/13b5e7cf-b908-4799-9d40-9ba4b21dbf00
      https://testing.whamcloud.com/test_sets/e4c617c1-30e1-4c44-8454-3a2a36ff1189

      Looking at the client console log, we see

      [37329.207145] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,test_dummy_encryption,skpath=/tmp/test-framework-keys trevis-8vm4@tcp:/lustre /mnt/lustre
      [37331.900380] Lustre: 1600717:0:(keysetup.c:472:llcrypt_get_encryption_info()) inode 144115188193296385: setting policy filenames_encryption_mode to null
      [37331.901861] Lustre: 1600717:0:(keysetup.c:472:llcrypt_get_encryption_info()) Skipped 5 previous similar messages
      [37334.491809] divide error: 0000 [#1] SMP PTI
      [37334.492358] CPU: 1 PID: 1550060 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
      [37334.493661] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [37334.494315] RIP: 0010:osc_announce_cached.isra.24+0xd4/0x3b0 [osc]
      [37334.494970] Code: af d0 48 8b 43 50 48 39 ca 48 0f 42 d1 48 89 d1 48 c1 e1 0c f6 80 c5 02 00 00 10 74 1a 8b b3 f0 00 00 00 48 8d 44 32 ff 31 d2 <48> f7 f6 0f af 83 ec 00 00 00 48 01 c1 48 81 f9 ff ff ff 6f b8 ff
      [37334.496814] RSP: 0018:ffffa4c700833a78 EFLAGS: 00010246
      [37334.497361] RAX: 0000000000023fff RBX: ffff958f013085f0 RCX: 0000000024000000
      [37334.498104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff958f013086e4
      [37334.498838] RBP: ffff958f1ba8c150 R08: 0000000000033c62 R09: ffff958f1ba8c248
      [37334.499564] R10: ffff958efd282880 R11: 0000000000002000 R12: ffff958f013086e4
      [37334.500297] R13: 0000000000000010 R14: 0000000000000002 R15: ffff958f013085f0
      [37334.501029] FS:  0000000000000000(0000) GS:ffff958f3fd00000(0000) knlGS:0000000000000000
      [37334.501848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [37334.502442] CR2: 00007fc4a2b46000 CR3: 000000005a20a002 CR4: 00000000000606e0
      [37334.503184] Call Trace:
      [37334.503498]  osc_brw_prep_request+0xf2e/0x15b0 [osc]
      [37334.504040]  osc_build_rpc+0xb82/0x10b0 [osc]
      [37334.504514]  osc_check_rpcs+0x1703/0x1880 [osc]
      [37334.505029]  ? __switch_to_asm+0x41/0x70
      [37334.505454]  ? __switch_to_asm+0x35/0x70
      [37334.505879]  ? __switch_to_asm+0x35/0x70
      [37334.506299]  ? __switch_to_asm+0x41/0x70
      [37334.506728]  osc_io_unplug0+0xc0/0x110 [osc]
      [37334.507194]  brw_queue_work+0x2e/0xc0 [osc]
      [37334.507801]  work_interpreter+0x32/0x110 [ptlrpc]
      [37334.508337]  ptlrpc_check_set+0x51a/0x21b0 [ptlrpc]
      [37334.508883]  ? finish_task_switch+0x77/0x2a0
      [37334.509368]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
      [37334.509889]  ptlrpcd+0x458/0x4c0 [ptlrpc]
      [37334.510332]  ? finish_wait+0x80/0x80
      [37334.510751]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
      [37334.511262]  kthread+0x112/0x130
      [37334.511622]  ? kthread_flush_work_fn+0x10/0x10
      [37334.512106]  ret_from_fork+0x35/0x40
      

      Attachments

        Activity

          [LU-14283] sanity-sec test 52 crashes due to ‘divide error: 0000 [#1] SMP PTI’
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41225/
          Subject: LU-14283 osc: avoid crash if ocd reset
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 1ec58a2e6ef98bacbb806f1b141ef38cdefe2226

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41225/ Subject: LU-14283 osc: avoid crash if ocd reset Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ec58a2e6ef98bacbb806f1b141ef38cdefe2226

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41256/
          Subject: LU-14283 obdclass: connect vs disconnect race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4f689a30fb5d8707270dfc9df9329b32fec8440f

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41256/ Subject: LU-14283 obdclass: connect vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f689a30fb5d8707270dfc9df9329b32fec8440f

          Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41256
          Subject: LU-14283 obdclass: connect vs disconnect race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 43d0981915600fa6030ccc4d496d2824745633c7

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41256 Subject: LU-14283 obdclass: connect vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 43d0981915600fa6030ccc4d496d2824745633c7
          jhammond John Hammond added a comment -

          It seems very unlikely, but if if osd_connect_flags includes GRANT_PARAM but not GRANT then we could get here.

          jhammond John Hammond added a comment - It seems very unlikely, but if if osd_connect_flags includes GRANT_PARAM but not GRANT then we could get here.

          I did not find why we could hit this issue.

          Notice once @cl_max_extent_pages is inited ever, even reconnect won't clear it to be zero unless i miss something here, and there is a window
          it could be zero but that was protected by spinlock.

          so big possibility is it was zero initially...

          wshilong Wang Shilong (Inactive) added a comment - I did not find why we could hit this issue. Notice once @cl_max_extent_pages is inited ever, even reconnect won't clear it to be zero unless i miss something here, and there is a window it could be zero but that was protected by spinlock. so big possibility is it was zero initially...
          jhammond John Hammond added a comment -

          > The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup.

          I do not think this is possible.

          jhammond John Hammond added a comment - > The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup. I do not think this is possible.

          People

            wshilong Wang Shilong (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: