Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8396

ll_dirty_page_discard_warn dropping pages during re-connection.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.8.0
    • None
    • Servers:
      Lustre 2.8.0
      kernel 3.10.0-327.3.1.el7_lustre.x86_64
      zfs backed OST

      Clients:
      Lustre 2.8.0
      kernel 2.6.32-573.12.1.el6_lustre.x86_64

      ipoib network
      Connected Mode, 65520 MTU
    • 3
    • 9223372036854775807

    Description

      During some client IO's being done by a gridftp transfer, we get some files that wind up shorter than they should. In the log files we see this set of messages:

      Jul 13 03:17:30 gridftp01 kernel: Lustre: 2937:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Jul 13 03:17:58 gridftp01 kernel: LustreError: 2902:0:(events.c:201:client_bulk_callback()) event type 2, status -5, desc ffff8807799da000
      Jul 13 03:17:58 gridftp01 kernel: Lustre: osg-OST0010-osc-ffff8802bb781000: Connection restored to 192.168.129.131@tcp8 (at 192.168.129.131@tcp8)
      Jul 13 03:17:58 gridftp01 kernel: LustreError: 11-0: osg-OST0010-osc-ffff8802bb781000: operation ost_read to node 192.168.129.131@tcp8 failed: rc = -107
      Jul 13 03:17:58 gridftp01 kernel: LustreError: 167-0: osg-OST0010-osc-ffff8802bb781000: This client was evicted by osg-OST0010; in progress operations using this service will fail.
      Jul 13 03:17:58 gridftp01 kernel: Lustre: 2937:0:(llite_lib.c:2626:ll_dirty_page_discard_warn()) osg: dirty page discard: 192.168.129.128@tcp8:/osg/fid: [0x200008124:0x23e:0x0]// may get corrupted (rc -108)

      using that FID you can usually retrace the file and see that it is shorter than it should be. we know this because the file size when it is created is stored in an external database, and errors are shown when it is read and the file checksums are incorrect, and the filesize is wrong.

      File sizes are in the 30-50 MB range.

      I've tried working back from the dirty_page error in the code to determine why this specific IO does not get retried after the reconnection that happens almost immediately. we are trying to figure out why the reconnections are happening, but it would be nice to know why these IO's dont complete properly.

      Attachments

        Activity

          People

            wc-triage WC Triage
            karcaw Evan Felix
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: