Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11236

client MDT OST ENOTCONN loops

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      When an OST is deactivated some MDT to OST operation will return -ENOTCONN. In some cases this will result in -ENOTCONN being returned to the client as the Lustre message status which the client ptlrpc code will interpret as indicating that the client has been "abruptly disconnected" from the MDT. Then the client will reconnect and resend the operation causing a loop.

      o:~# bash $LUSTRE/tests/llmount.sh
      ...
      o:~# lctl set_param osp.lustre-OST0000-osc-MDT0000.active=0
      osp.lustre-OST0000-osc-MDT0000.active=0
      o:~# touch /mnt/lustre/f0
      o:~# chown sanity: /mnt/lustre/f0
      o:~# lctl set_param debug=+trace
      debug=+trace
      o:~# lctl set_param debug_mb=64
      debug_mb=64
      o:~# lctl clear
      o:~# sudo -u sanity chgrp gsanity0 /mnt/lustre/f0 &
      [1] 12744
      o:~# sleep 4
      o:~# lctl set_param osp.lustre-OST0000-osc-MDT0000.active=1
      osp.lustre-OST0000-osc-MDT0000.active=1
      o:~# wait
      [1]+  Done                    sudo -u sanity chgrp gsanity0 /mnt/lustre/f0
      o:~# lctl dk > /tmp/1.dk
      

      With lots of lines elided:

      00000100:00100000:0.0:1534175675.393208:0:12746:0:(client.c:1625:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc chgrp:7dc6380a-74bd-0c12-bf7e-6c11809b0af5:12746:1608699144600192:0@lo:36
      00000100:00100000:0.0:1534175675.393458:0:11330:0:(service.c:2129:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:7dc6380a-74bd-0c12-bf7e-6c11809b0af5+13:12746:x1608699144600192:12345-0@lo:36
      00000004:00000001:0.0:1534175675.393963:0:11330:0:(osp_dev.c:815:osp_sync()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
      00000004:00020000:0.0:1534175675.393967:0:11330:0:(lod_dev.c:1415:lod_sync()) lustre-MDT0000-mdtlov: can't sync ost 0: -107
      00000004:00000001:0.0:1534175675.429293:0:11330:0:(lod_dev.c:1422:lod_sync()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
      00000004:00000001:0.0:1534175675.429298:0:11330:0:(mdd_object.c:1174:mdd_attr_set()) Process leaving via out (rc=18446744073709551509 : -107 : 0xffffffffffffff95)
      00000100:00100000:0.0:1534175675.429886:0:11330:0:(service.c:2179:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt00_001:7dc6380a-74bd-0c12-bf7e-6c11809b0af5+8:12746:x1608699144600192:12345-0@lo:36 Request processed in 36428us (36621us total) trans 0 rc -107/-107
      
      Client:
      00000100:00000001:0.0:1534175675.430424:0:12746:0:(client.c:1284:ptlrpc_check_status()) Process leaving (rc=18446744073709551509 : -107 : ffffffffffffff95)
      00000100:00000001:0.0:1534175675.430429:0:12746:0:(recover.c:232:ptlrpc_request_handle_notconn()) Process entered
      00000100:00080000:0.0:1534175675.430430:0:12746:0:(recover.c:236:ptlrpc_request_handle_notconn()) import lustre-MDT0000-mdc-ffff8eb4e8bb6800 of lustre-MDT0000_UUID@192.168.122.131@tcp abruptly disconnected: reconnecting
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jhammond John Hammond
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: