Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-1455

Drop of the performance and 'scratch-OSTXXXX: Recovery'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Critical
    • None
    • Lustre 2.2.0
    • None
    • 3
    • 3997

    Description

      Dear support, during a normal usage of the file system, we noticed an major drop down of the I/O performance (ex. from 5Gb/s to 0.9Gb/s ) and into the log a bunch of:

      May 31 10:17:39 weisshorn06 kernel: Lustre: scratch-OST001d: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
      May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(filter.c:2699:filter_connect_internal()) scratch-OST001d: Received MDS connection for group 0
      May 31 10:17:39 weisshorn06 kernel: Lustre: scratch-OST001d: received MDS connection from 148.187.7.101@o2ib2
      May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880f8b248800 - ffff880edff5b800
      May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message
      May 31 10:17:39 weisshorn06 kernel: Lustre: 7445:0:(filter.c:2555:filter_llog_connect()) scratch-OST001d: Recovery from log 0x140001f/0x0:b3e57b20

      May 31 10:21:32 weisshorn01 kernel: LustreError: 7733:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x5ef sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9518:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2560 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 4004:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x225b sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 3986:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x1257 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7357:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x76b0 sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9539:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x879 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7324:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0xf3ad sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 3988:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2d00 sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7746:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x1220 sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9640:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x230f sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7328:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x5b4 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9476:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2591 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 4062:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x28f3 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9654:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2251 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7737:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x251c sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7347:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2bd9 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7337:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x2560 sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7725:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x23d8 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 3908:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x8d2 sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9484:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x57b sub-object on OST idx 12/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 7403:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0xc7e sub-object on OST idx 49/38: rc = -5
      May 31 10:21:33 weisshorn01 kernel: LustreError: 9689:0:(lov_request.c:560:lov_update_create_set()) error creating fid 0x16c9b sub-object on OST idx 12/38: rc = -5

      May 31 10:21:49 weisshorn08 kernel: Lustre: scratch-OST0017: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
      May 31 10:21:49 weisshorn08 kernel: Lustre: Skipped 1 previous similar message
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2699:filter_connect_internal()) scratch-OST0017: Received MDS connection for group 0
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2699:filter_connect_internal()) Skipped 1 previous similar message
      May 31 10:21:49 weisshorn08 kernel: Lustre: scratch-OST0017: received MDS connection from 148.187.7.101@o2ib2
      May 31 10:21:49 weisshorn08 kernel: Lustre: Skipped 1 previous similar message
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff880fc7844800 - ffff880fc68b5000
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(llog_net.c:168:llog_receptor_accept()) Skipped 3 previous similar messages
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2555:filter_llog_connect()) scratch-OST0017: Recovery from log 0x1400019/0x0:b3e57b1a
      May 31 10:21:49 weisshorn08 kernel: Lustre: 7332:0:(filter.c:2555:filter_llog_connect()) Skipped 1 previous similar message
      May 31 10:21:49 weisshorn04 kernel: Lustre: scratch-OST0025: Client scratch-MDT0000-mdtlov_UUID (at 148.187.7.101@o2ib2) reconnecting
      May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(filter.c:2699:filter_connect_internal()) scratch-OST0025: Received MDS connection for group 0
      May 31 10:21:49 weisshorn04 kernel: Lustre: scratch-OST0025: received MDS connection from 148.187.7.101@o2ib2
      May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(llog_net.c:168:llog_receptor_accept()) changing the import ffff8802d1d4c800 - ffff880fbe6d3800
      May 31 10:21:49 weisshorn04 kernel: Lustre: 7284:0:(llog_net.c:168:llog_receptor_accept()) Skipped 1 previous similar message
      May 31 10:21:49 weisshorn04 kernel: Lustre: 7603:0:(filter.c:2555:filter_llog_connect()) scratch-OST0025: Recovery from log 0x1400027/0x0:b3e57b28

      We are running an IOR benchmark (writing an aggregate file of 512Gb) and the first run was fine, but now we monitoring with LTOP and the output of the job that the performance are drop down.
      Also few users complain about a behavior like that running hdf5 jobs, the first runs was fine but the following were really slow.

      The load on the machine regarding CPU is really low.

      Attachments

        Activity

          People

            green Oleg Drokin
            fverzell Fabio Verzelloni
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: