Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13541

Application hang and kernel "divide error" in ll_readpage

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      While launching Slurm jobs on our cluster, some of jobs hung quite early, in a "dd" command where a dataset is copied to a local ext4 filesystem:

      ll_file_read_iter+0xa1/0x290 [lustre]
      new_sync_read+0x122/0x1b0
      __vfs_read+0x29/0x40
      vfs_read+0x8e/0x130
      ksys_read+0xa7/0xe0
      

      It happened on multiple nodes on the cluster. But on some nodes, it works fine. It seems to be correlated to a kernel "divide error" (division by zero?) in the kernel log of those nodes:

      [ 2171.682001] divide error: 0000 1 SMP NOPTI
      [ 2171.686888] CPU: 133 PID: 35015 Comm: python Tainted: P OE 5.3.0-24-generic #26~18.04.2-Ubuntu
      [ 2171.706858] RIP: 0010:ll_readpage+0x25d/0x730 [lustre]
      [ 2171.802801] Call Trace:
      [ 2171.805539] filemap_fault+0x9be/0x9f0
      [ 2171.830810] ll_fault+0xdb/0x710 [lustre]
      [ 2171.839869] __do_fault+0x57/0x117
      [ 2171.843668] __handle_mm_fault+0xda0/0x1230
      [ 2171.848344] handle_mm_fault+0xcb/0x210
      [ 2171.852634] __do_page_fault+0x2a1/0x4d0
      [ 2171.857018] do_page_fault+0x2c/0xe0
      [ 2171.861014] page_fault+0x34/0x40
      

      It seems that some of these errors were caused by these jobs, according to the time. But some of them were not (probably by another unrelated job); but the bad state lingers and block anyone wanting to access this particular file. Other files seem fine, but this file is now poisoned.

      Attachments

        Issue Links

          Activity

            People

              wshilong Wang Shilong (Inactive)
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: