Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.10.4
    • None
    • centos7, x86_64, OPA, zfs, compression on.
    • 3
    • 9223372036854775807

    Description

      Hi,

      clients nodes hang and loop forever when codes go over group quota. Lustre is very verbose when this happens. the below is pretty typical. john50 is a client. arkles are OSS's.

      Jun 23 13:22:42 john50 kernel: LNetError: 895:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949771616 length 104
      8576 too big: 1048208 left, 1048208 allowed
      Jun 23 13:22:42 arkle2 kernel: LustreError: 297785:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff8815b4044a00
      Jun 23 13:22:42 arkle2 kernel: LustreError: 297785:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff8815b4044a00
      Jun 23 13:22:42 arkle2 kernel: LustreError: 272882:0:(ldlm_lib.c:3253:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff8816bf52cc50 x1603955949771616/t0(0) o4->fe53a66a-b8c6-e1de-1353-a3b91bd42058@192.168.44.150@o2ib44:548/0 lens 608/448 e 0 to 0 dl 1529724168 ref 1 fl Interpret:/0/0 rc 0/0
      Jun 23 13:22:42 arkle2 kernel: LustreError: 272882:0:(ldlm_lib.c:3253:target_bulk_io()) Skipped 73 previous similar messages
      Jun 23 13:22:42 arkle2 kernel: Lustre: dagg-OST0002: Bulk IO write error with fe53a66a-b8c6-e1de-1353-a3b91bd42058 (at 192.168.44.150@o2ib44), client will retry: rc = -110
      Jun 23 13:22:42 arkle2 kernel: Lustre: Skipped 73 previous similar messages
      Jun 23 13:22:49 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1529724162/real 1529724162]  req@ffff8817b7ab4b00 x1603955949771616/t0(0) o4->dagg-OST0002-osc-ffff882fcafb0800@192.168.44.32@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1529724169 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      Jun 23 13:22:49 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
      Jun 23 13:22:49 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection to dagg-OST0002 (at 192.168.44.32@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
      Jun 23 13:22:49 arkle2 kernel: Lustre: dagg-OST0002: Client fe53a66a-b8c6-e1de-1353-a3b91bd42058 (at 192.168.44.150@o2ib44) reconnecting
      Jun 23 13:22:49 arkle2 kernel: Lustre: Skipped 72 previous similar messages
      Jun 23 13:22:49 arkle2 kernel: Lustre: dagg-OST0002: Connection restored to  (at 192.168.44.150@o2ib44)
      Jun 23 13:22:49 arkle2 kernel: Lustre: Skipped 59 previous similar messages
      Jun 23 13:22:49 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection restored to 192.168.44.32@o2ib44 (at 192.168.44.32@o2ib44)
      Jun 23 13:22:49 john50 kernel: LNetError: 893:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949773296 length 1048576 too big: 1048208 left, 1048208 allowed
      Jun 23 13:22:49 arkle2 kernel: LustreError: 297783:0:(events.c:449:server_bulk_callback()) event type 5, status -61, desc ffff881055ca0e00
      Jun 23 13:22:49 arkle2 kernel: LustreError: 297783:0:(events.c:449:server_bulk_callback()) event type 3, status -61, desc ffff881055ca0e00
      Jun 23 13:22:56 john50 kernel: Lustre: 906:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1529724169/real 1529724169]  req@ffff8817b7ab4b00 x1603955949771616/t0(0) o4->dagg-OST0002-osc-ffff882fcafb0800@192.168.44.32@o2ib44:6/4 lens 608/448 e 0 to 1 dl 1529724176 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      Jun 23 13:22:56 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection to dagg-OST0002 (at 192.168.44.32@o2ib44) was lost; in progress operations using this service will wait for recovery to complete
      Jun 23 13:22:56 john50 kernel: Lustre: dagg-OST0002-osc-ffff882fcafb0800: Connection restored to 192.168.44.32@o2ib44 (at 192.168.44.32@o2ib44)
      Jun 23 13:22:56 john50 kernel: LNetError: 895:0:(lib-ptl.c:190:lnet_try_match_md()) Matching packet from 12345-192.168.44.32@o2ib44, match 1603955949773424 length 1048576 too big: 1048208 left, 1048208 allowed
      

      the messages don't go away when the code exits. the node stays (somewhat) broken afterwards. only rebooting the client seems to fix it.

      reports from the users seem to indicate that it's not 100% repeatable, but is reasonably close.

      our OSTs are pretty plain and simple z3 12+3's with 4 of those making up one OST pool. 2M recordsize and compression on.

      [arkle2]root: zfs get all | grep /OST | egrep 'compression|record'
      arkle2-dagg-OST2-pool/OST2  recordsize            2M                                         local
      arkle2-dagg-OST2-pool/OST2  compression           lz4                                        inherited from arkle2-dagg-OST2-pool
      arkle2-dagg-OST3-pool/OST3  recordsize            2M                                         local
      arkle2-dagg-OST3-pool/OST3  compression           lz4                                        inherited from arkle2-dagg-OST3-pool
      

      cheers,
      robin

      Attachments

        Issue Links

          Activity

            [LU-11093] clients hang when over quota
            pjones Peter Jones added a comment -

            Good news - thanks!

            pjones Peter Jones added a comment - Good news - thanks!
            scadmin SC Admin added a comment -

            Hi,

            we haven't seen this issue again so are presuming it's fixed by LU-10683.
            thanks!

            cheers,
            robin

            scadmin SC Admin added a comment - Hi, we haven't seen this issue again so are presuming it's fixed by LU-10683 . thanks! cheers, robin
            scadmin SC Admin added a comment -

            ok. we're running that patch on the largest of the 4 filesystems now. we'll let you know if we see it again. thanks!

            cheers,
            robin

            scadmin SC Admin added a comment - ok. we're running that patch on the largest of the 4 filesystems now. we'll let you know if we see it again. thanks! cheers, robin

            I have managed to reproduce the "BAD CHECKSUM ERROR" locality, but can't reproduce the "lnet_try_match_md" issue.
            but it could be the same one, which is caused by some bug in the osd_zfs module.

            Could you please try the patch https://review.whamcloud.com/32788 in LU-10683?
            Thanks!

            hongchao.zhang Hongchao Zhang added a comment - I have managed to reproduce the "BAD CHECKSUM ERROR" locality, but can't reproduce the "lnet_try_match_md" issue. but it could be the same one, which is caused by some bug in the osd_zfs module. Could you please try the patch https://review.whamcloud.com/32788 in LU-10683 ? Thanks!
            scadmin SC Admin added a comment -

            Hi Hongchao,

            thanks for looking at this.

            all clients have the same

            $ lctl get_param osc.*.max_pages_per_rpc
            osc.apps-OST0000-osc-ffff8ad8dad4c000.max_pages_per_rpc=256
            osc.apps-OST0001-osc-ffff8ad8dad4c000.max_pages_per_rpc=256
            osc.dagg-OST0000-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0001-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0002-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0003-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0004-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0005-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0006-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0007-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0008-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST0009-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000a-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000b-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000c-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000d-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000e-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.dagg-OST000f-osc-ffff8ac199cad800.max_pages_per_rpc=512
            osc.home-OST0000-osc-ffff8af01e3ff800.max_pages_per_rpc=256
            osc.home-OST0001-osc-ffff8af01e3ff800.max_pages_per_rpc=256
            osc.images-OST0000-osc-ffff8ad8da08d000.max_pages_per_rpc=256
            osc.images-OST0001-osc-ffff8ad8da08d000.max_pages_per_rpc=256
            

            and on servers for the big filesystem (group quotas)

            obdfilter.dagg-OST0000.brw_size=2
            obdfilter.dagg-OST0001.brw_size=2
            obdfilter.dagg-OST0002.brw_size=2
            obdfilter.dagg-OST0003.brw_size=2
            obdfilter.dagg-OST0004.brw_size=2
            obdfilter.dagg-OST0005.brw_size=2
            obdfilter.dagg-OST0006.brw_size=2
            obdfilter.dagg-OST0007.brw_size=2
            obdfilter.dagg-OST0008.brw_size=2
            obdfilter.dagg-OST0009.brw_size=2
            obdfilter.dagg-OST000a.brw_size=2
            obdfilter.dagg-OST000b.brw_size=2
            obdfilter.dagg-OST000c.brw_size=2
            obdfilter.dagg-OST000d.brw_size=2
            obdfilter.dagg-OST000e.brw_size=2
            obdfilter.dagg-OST000f.brw_size=2
            

            and the small filesystems (only /home has user quotas, the rest have no quotas)

            obdfilter.apps-OST0000.brw_size=1
            obdfilter.apps-OST0001.brw_size=1
            obdfilter.home-OST0000.brw_size=1
            obdfilter.home-OST0001.brw_size=1
            obdfilter.images-OST0000.brw_size=1
            obdfilter.images-OST0001.brw_size=1
            

            it'll take a while for us to try to reproduce the problem artificially. I don't think I even have a user account with quotas, so I'll have to get one setup etc.

            cheers,
            robin

            scadmin SC Admin added a comment - Hi Hongchao, thanks for looking at this. all clients have the same $ lctl get_param osc.*.max_pages_per_rpc osc.apps-OST0000-osc-ffff8ad8dad4c000.max_pages_per_rpc=256 osc.apps-OST0001-osc-ffff8ad8dad4c000.max_pages_per_rpc=256 osc.dagg-OST0000-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0001-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0002-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0003-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0004-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0005-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0006-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0007-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0008-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST0009-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000a-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000b-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000c-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000d-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000e-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.dagg-OST000f-osc-ffff8ac199cad800.max_pages_per_rpc=512 osc.home-OST0000-osc-ffff8af01e3ff800.max_pages_per_rpc=256 osc.home-OST0001-osc-ffff8af01e3ff800.max_pages_per_rpc=256 osc.images-OST0000-osc-ffff8ad8da08d000.max_pages_per_rpc=256 osc.images-OST0001-osc-ffff8ad8da08d000.max_pages_per_rpc=256 and on servers for the big filesystem (group quotas) obdfilter.dagg-OST0000.brw_size=2 obdfilter.dagg-OST0001.brw_size=2 obdfilter.dagg-OST0002.brw_size=2 obdfilter.dagg-OST0003.brw_size=2 obdfilter.dagg-OST0004.brw_size=2 obdfilter.dagg-OST0005.brw_size=2 obdfilter.dagg-OST0006.brw_size=2 obdfilter.dagg-OST0007.brw_size=2 obdfilter.dagg-OST0008.brw_size=2 obdfilter.dagg-OST0009.brw_size=2 obdfilter.dagg-OST000a.brw_size=2 obdfilter.dagg-OST000b.brw_size=2 obdfilter.dagg-OST000c.brw_size=2 obdfilter.dagg-OST000d.brw_size=2 obdfilter.dagg-OST000e.brw_size=2 obdfilter.dagg-OST000f.brw_size=2 and the small filesystems (only /home has user quotas, the rest have no quotas) obdfilter.apps-OST0000.brw_size=1 obdfilter.apps-OST0001.brw_size=1 obdfilter.home-OST0000.brw_size=1 obdfilter.home-OST0001.brw_size=1 obdfilter.images-OST0000.brw_size=1 obdfilter.images-OST0001.brw_size=1 it'll take a while for us to try to reproduce the problem artificially. I don't think I even have a user account with quotas, so I'll have to get one setup etc. cheers, robin
            hongchao.zhang Hongchao Zhang added a comment - - edited

            Hi Robin,

            Is it possible to apply some debug patch in your site and collect some logs when this issue is triggered?
            I can't reproduce this issue locally, and it is better to have more logs to trace this problem.

            btw, what is the following value at your site?

            #at OST
            lctl get_param obdfilter.*.brw_size
            
            #at Client
            lctl get_param osc.*.max_pages_per_rpc
            

            Thanks,
            Hongchao

            hongchao.zhang Hongchao Zhang added a comment - - edited Hi Robin, Is it possible to apply some debug patch in your site and collect some logs when this issue is triggered? I can't reproduce this issue locally, and it is better to have more logs to trace this problem. btw, what is the following value at your site? #at OST lctl get_param obdfilter.*.brw_size #at Client lctl get_param osc.*.max_pages_per_rpc Thanks, Hongchao

            People

              hongchao.zhang Hongchao Zhang
              scadmin SC Admin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: