[LU-1503] Clients application IO errors and overloaded system messages Created: 11/Jun/12  Updated: 20/Jan/14  Resolved: 20/Jan/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Nicola Bianchi Assignee: Cliff White (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

----------------------------------------------------------------------------------------------------

    1. MDS HW ##
                                                                                                                                                                                                        • Linux XXXX.admin.cscs.ch 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64
                                                                                                                                                                                                          Architecture: x86_64
                                                                                                                                                                                                          CPU op-mode(s): 32-bit, 64-bit
                                                                                                                                                                                                          Byte Order: Little Endian
                                                                                                                                                                                                          CPU(s): 16
                                                                                                                                                                                                          Vendor ID: AuthenticAMD
                                                                                                                                                                                                          CPU family: 16
                                                                                                                                                                                                          64Gb RAM
                                                                                                                                                                                                          Interconnect IB 40Gb/s

                                                                                                                                                                                                          MDT LSI 5480 Pikes Peak
                                                                                                                                                                                                          SSDs SLC

----------------------------------------------------------------------------------------------------

    1. OSS HW ##
                                                                                                                                                                                                        • Architecture: x86_64
                                                                                                                                                                                                          CPU op-mode(s): 32-bit, 64-bit
                                                                                                                                                                                                          Byte Order: Little Endian
                                                                                                                                                                                                          CPU(s): 32
                                                                                                                                                                                                          Vendor ID: GenuineIntel
                                                                                                                                                                                                          CPU family: 6
                                                                                                                                                                                                          64Gb RAM
                                                                                                                                                                                                          Interconnect IB 40Gb/s

                                                                                                                                                                                                          OSTs ---> LSI 7900 SATA Disks

----------------------------------------------------------------------------------------------------

    1. Router nodes ##
                                                                                                                                                                                                        • 12 Cray XE6 Service nodes as router nodes - IB 40Gb/s
                                                                                                                                                                                                          ----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------

    1. Clients ##
                                                                                                                                                                                                        • ~ 1500 Cray XE6 nodes - Lustre 1.8.6
                                                                                                                                                                                                          ----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------

    1. LUSTRE Config ##
                                                                                                                                                                                                        • 1 MDS + 1 fail over (MDT on SSD array)
                                                                                                                                                                                                          12 OSSs - 6 OSTs per OSS (72 OSTs)

Luster Servers ---> 2.2.51.0
Lustre Clients ---> 1.8.6 (~1500 nodes) / 2.2.51.0 (~20 nodes)
----------------------------------------------------------------------------------------------------


Attachments: File cluster.log     File cluster.log-2012-06-09_05     File cluster.log-2012-06-11_12     File craylog-2012-06-09.log     File craylog-2012-06-11.log     File debug_lustre     File drop_conn.log     PDF File ganglia-load-2012-06-09.pdf     PDF File ganglia-load-2012-06-11.pdf     File log1     File log_11_jul     File messages_router_node.log    
Severity: 1
Rank (Obsolete): 4042

 Description   

Dear Support,
we experienced some problem with our Lustre FS.
Our users complained that at the following times their jobs were killed due IO errors.

Saturday 09 June 05:58
Monday 11 June 12:47

We collected the logs from the servers and clients side and actually we saw a lot of messages/errors that we have problem to "decode".
Could you please help us to undestand why this problem arise ?
In the specific we don't understand if it's really a overload problem related to the hardware or configuration we used otherwise some congestion/bug issue...

Usually the FS seems to work correctly but suddently the log fill up of these messages.

Regards
Nicola



 Comments   
Comment by Cliff White (Inactive) [ 11/Jun/12 ]

Do you know what the load factor was on the servers when the outages occurred? It appears you are having client evictions, due to server timeouts.

Comment by Nicola Bianchi [ 12/Jun/12 ]

Cliff,
it seems that there was more load than usual on the Lustre servers ...
Could be interesting understand why because the HW behind is pretty powerful (12 brand new Sandy Bridge OSSs and 6 LSI 7900 Controllers with 8Gbit FC interfaces).

Could be some configuration problem?

Regards
Nicola

Comment by Fabio Verzelloni [ 12/Jun/12 ]

/modprobe.conf.local of router nodes on our Cray XE system:

---------------------------------------------------------
options dvsipc_lnet lnd_name=gni1
options qla2xxx ql2xfailover=0
options libcfs libcfs_panic_on_lbug=1
options lnet ip2nets="gni 172.27..; o2ib2 148.187.7.*"
options lnet routes="gni 148.187.6.[71-82]@o2ib2; o2ib2 [220,226,270,304,394,436,474,484,1364,1386,1476,1530]@gni"
options lnet check_routers_before_use=1 router_ping_timeout=5
options lnet dead_router_check_interval=60 live_router_check_interval=60
options kptllnd max_nodes=5000 credits=2048 timeout=250

    1. Enable MSI for Mellanox ConnectX HCAs
      options mlx4_core msi_x=1
      ---------------------------------------------------------

Modprobe.conf.local of a client that mount the file system on our Cray XE:
---------------------------------------------------------
options lnet ip2nets="gni1,gni 172.27.. ;o2ib 148.187.6.;o2ib2 148.187.7."
options lnet routes="gni1 148.187.6.[142,140,141,146]@o2ib; o2ib [62,174,967,1110]@gni1; gni 148.187.7.[71-82]@o2ib2; o2ib2 [220,226,270,304,394,436,474,484,1364,1386,1476,1530]@gni"
options lnet check_routers_before_use=1 router_ping_timeout=5
options lnet dead_router_check_interval=60 live_router_check_interval=60
options qla2xxx ql2xlogintimeout=0
options ost oss_num_threads=256
options libcfs libcfs_panic_on_lbug=1
options dvsipc_lnet lnd_name=gni1
---------------------------------------------------------

modprobe.conf of a server:
---------------------------------------------------------
options lnet networks="o2ib2(ib0)"
options lnet routes="gni 148.187.7.[71-82]@o2ib2"
options lnet check_routers_before_use=1
options lnet router_ping_timeout=5
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
---------------------------------------------------------

Comment by Cliff White (Inactive) [ 12/Jun/12 ]

I don't think so, with configuration issues things tend to Not Work. I think this may be network-load related, are you sure the network is healthy? What does load look like on the router nodes?

Comment by Nicola Bianchi [ 13/Jun/12 ]

Cliff,
thanks for the quick answer.
For what we saw the 12 routers nodes seems to be pretty quiet but we will monitor the situation.

For the infiniband network part we are checking if all is working well, so far the common IB tests reports a good shape but we see some strange behavior for clients that directly mount the FS through IB. In the logs we found messages like this:

[163598.503276] LustreError: 11-0: an error occurred while communicating with 148.187.7.106@o2ib2. The ost_connect operation failed with -16
[163598.503287] LustreError: Skipped 1 previous similar message

...and we figured out that some common operation like untar a 2GB file sometime hang in conjunction the messages above.

Any advice is welcome.

Regards
Nicola

Comment by Cliff White (Inactive) [ 13/Jun/12 ]

The -EBUSY would indicate that the OST was in recovery, or disconnecting/reconnecting. You should not see that message in normal operation. Are there any 'slow IO' messages in your logs?
You need to check the logs for 148.187.7.106 and see what was happening on that server when the client reported the -EBUSY (-16)

I would suggest running lnet_selftest to verify your network help, there are example scripts in the Lustre Manual.

Comment by Nicola Bianchi [ 14/Jun/12 ]

Cliff,
in the logs for 148.187.7.106 (weisshorn06) we got this kind of messages:

-----------------------------------------------------------------------------------
Jun 14 09:50:47 weisshorn06 kernel: Lustre: scratch-OST0041: Client 1d3575c8-d2c3-3917-0d4b-ae0b2c9af8d8 (at 148.187.6.237@o2ib2) reconnecting
Jun 14 09:50:47 weisshorn06 kernel: Lustre: Skipped 5 previous similar messages
Jun 14 09:50:47 weisshorn06 kernel: Lustre: scratch-OST0041: Client 1d3575c8-d2c3-3917-0d4b-ae0b2c9af8d8 (at 148.187.6.237@o2ib2) refused reconnection, still busy with 11 active R
PCs
Jun 14 09:50:47 weisshorn06 kernel: Lustre: Skipped 1 previous similar message
Jun 14 09:50:47 weisshorn06 kernel: LustreError: 6370:0:(ldlm_lib.c:2725:target_bulk_io()) @@@ bulk GET failed: rc 107 req@ffff88101b3b9050 x1403838171590959/t0(0) o4>1d3575c8-
d2c3-3917-0d4b-ae0b2c9af8d8@148.187.6.237@o2ib2:0/0 lens 456/416 e 0 to 0 dl 1339660290 ref 1 fl Interpret:/0/0 rc 0/0
Jun 14 09:50:47 weisshorn06 kernel: Lustre: scratch-OST0041: Bulk IO write error with 1d3575c8-d2c3-3917-0d4b-ae0b2c9af8d8 (at 148.187.6.237@o2ib2), client will retry: rc -107
Jun 14 09:50:47 weisshorn06 kernel: LustreError: 6370:0:(ldlm_lib.c:2725:target_bulk_io()) Skipped 3 previous similar messages
Jun 14 09:50:47 weisshorn06 kernel: LustreError: 8029:0:(ldlm_lib.c:2725:target_bulk_io()) @@@ bulk GET failed: rc 107 req@ffff880c9942fc00 x1403838171590983/t0(0) o4>1d3575c8-
d2c3-3917-0d4b-ae0b2c9af8d8@148.187.6.237@o2ib2:0/0 lens 456/416 e 0 to 0 dl 1339660290 ref 1 fl Interpret:/0/0 rc 0/0
Jun 14 09:50:47 weisshorn06 kernel: LustreError: 8029:0:(ldlm_lib.c:2725:target_bulk_io()) Skipped 4 previous similar messages
Jun 14 09:51:11 weisshorn06 kernel: Lustre: scratch-OST0041: Client 1d3575c8-d2c3-3917-0d4b-ae0b2c9af8d8 (at 148.187.6.237@o2ib2) reconnecting
-----------------------------------------------------------------------------------

I attach also the complete log that contain all the messages from all the OSS/MDS.

Regards
Nicola

Comment by Nicola Bianchi [ 14/Jun/12 ]

Cliff,
about the lnet_selftest we run a batch script like in the manual and we got this:

-------------------------------------------------------------------------------------
SESSION: read/write TIMEOUT: 300 FORCE: No
148.187.7.[101-114]@o2ib2 are added to session
148.187.6.[6,7,8]@o2ib2 are added to session
148.187.6.[6,7,8]@o2ib2 are added to session
Test was added successfully
Test was added successfully
bulk_rw is running now
[LNet Rates of servers]
[R] Avg: 1395 RPC/s Min: 1 RPC/s Max: 5802 RPC/s
[W] Avg: 1555 RPC/s Min: 1 RPC/s Max: 6653 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 72.86 MB/s Min: 0.00 MB/s Max: 163.51 MB/s
[W] Avg: 163.19 MB/s Min: 0.00 MB/s Max: 855.02 MB/s
[LNet Rates of servers]
[R] Avg: 1450 RPC/s Min: 15 RPC/s Max: 5744 RPC/s
[W] Avg: 1610 RPC/s Min: 15 RPC/s Max: 6599 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 105.09 MB/s Min: 0.00 MB/s Max: 240.24 MB/s
[W] Avg: 163.03 MB/s Min: 0.00 MB/s Max: 858.62 MB/s
[LNet Rates of servers]
[R] Avg: 1459 RPC/s Min: 0 RPC/s Max: 5800 RPC/s
[W] Avg: 1619 RPC/s Min: 0 RPC/s Max: 6657 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 88.47 MB/s Min: 0.00 MB/s Max: 199.88 MB/s
[W] Avg: 163.60 MB/s Min: 0.00 MB/s Max: 858.06 MB/s
[LNet Rates of servers]
[R] Avg: 2946 RPC/s Min: 0 RPC/s Max: 7408 RPC/s
[W] Avg: 3104 RPC/s Min: 0 RPC/s Max: 8241 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 85.92 MB/s Min: 0.00 MB/s Max: 192.92 MB/s
[W] Avg: 163.32 MB/s Min: 0.00 MB/s Max: 838.31 MB/s
[LNet Rates of servers]
[R] Avg: 1374 RPC/s Min: 0 RPC/s Max: 5569 RPC/s
[W] Avg: 1534 RPC/s Min: 0 RPC/s Max: 6419 RPC/s
[LNet Bandwidth of servers]
[R] Avg: 70.01 MB/s Min: 0.00 MB/s Max: 162.32 MB/s
[W] Avg: 165.01 MB/s Min: 0.00 MB/s Max: 854.23 MB/s
session is ended
lustre_self_test.sh: line 17: 17805 Terminated lst stat servers
-------------------------------------------------------------------------------------

Still difficult for us to understand the exact meaning of the results but running multiple times we got always the same scores.
At a glance: we run 3 clients (that time to time loses the connections) against the 14 nodes of the Lustre FS.

Here the script we used:
-------------------------------------------------------------------------------------
#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 148.187.7.[101-114]@o2ib2
lst add_group readers 148.187.6.[6,7,8]@o2ib2
lst add_group writers 148.187.6.[6,7,8]@o2ib2
lst add_batch bulk_rw
lst add_test --batch bulk_rw --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --from writers --to servers \
brw write check=full size=4K
##start running
lst run bulk_rw
##display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
##tear down
lst end_session
-------------------------------------------------------------------------------------

Regards
Nicola

Comment by Cliff White (Inactive) [ 14/Jun/12 ]

Those result tell you how much IO you can generate from three clients. How many clients do you normally run?
I would suggest doing a test with all your clients, and monitor your network hardware for errors.
Also, please list the distribution and Lustre version for your clients and servers.

Comment by Cliff White (Inactive) [ 14/Jun/12 ]

From the router logs, it appears the gni side is not especially happy, there are timeouts, mis-routes:

Jun 11 12:29:48 nid00484 kernel: LNet: 12032:0:(gnilnd_conn.c:1872:kgnilnd_reaper_dgram_check()) GNILND_DGRAM_REQ datagram to 385@gni timed out @ 63s dgram 0xffff8803cf3e26c8 state GNILND_DGRAM_POSTED conn 0xffff880406d64400
Jun 11 12:29:48 nid00484 kernel: HWERR[2899]:0x0b11:SSID Detected Misrouted Packet:Info1=0x8001025000014281:Info2=0x0:Info3=0x2072
Jun 11 12:29:51 nid00484 kernel: LNet: could not send to 385@gni due to connection setup failure after 66 seconds
Jun 11 12:29:51 nid00484 kernel: LNet: Skipped 12 previous similar messages
Jun 11 12:29:51 nid00484 kernel: LNet: 12032:0:(gnilnd_conn.c:790:kgnilnd_process_dgram()) hardware timeout for connect to 385@gni after 0 seconds. Is node dead?
Jun 11 12:29:51 nid00484 kernel: HWERR[2900]:0x0b11:SSID Detected Misrouted Packet:Info1=0x8001025000014281:Info2=0x0:Info3=0x20cc
Jun 11 12:30:12 nid00484 kernel: HWERR[2901]:0x0b11:SSID Detected Misrouted Packet:Info1=0x8001025000014281:Info2=0x0:Info3=0x2126
Jun 11 12:30:34 nid00484 kernel: HWERR[2902]:0x0b11:SSID Detected Misrouted Packet:Info1=0x8001025000014281:Info2=0x0:Info3=0x2180
Jun 11 12:30:56 nid00484 kernel: LNet: could not send to 385@gni due to connection setup failure after 65 seconds

can you identify which node uses '385@gni' as it's address? Checking that node's logs might produce some useful information.

Comment by Nicola Bianchi [ 15/Jun/12 ]

Cliff,
usually we run from about 1500 Cray XE6 nodes and ~20 standard Linux Infiniband nodes.

Cheers
Nicola

Comment by Nicola Bianchi [ 15/Jun/12 ]

Cliff,
about the '385@gni':

this node failed (HW Failure nid00385 c4-0c0s0n1) and for this reason you see the error.

Cheers
Nicola

Comment by Nicola Bianchi [ 18/Jun/12 ]

Cliff,
we noticed that when on our 1500 nodes Cray machine we have congestion problems (due HW problem for instance) the Lustre servers suffer the situation, as you can see in the logs.

There is some timeouts parameter that we can tune to mitigate the situation?

We know that on the Cray side the time to reconfigure the routing due an error should not be so long but anyway it seems that something with the filesystem is still hanging ...

Regards
Nicola

Comment by Cliff White (Inactive) [ 18/Jun/12 ]

I will investigate, but most timeouts in lustre are auto-tuned. I am talking to our engineers.

Comment by Liang Zhen (Inactive) [ 21/Jun/12 ]

I didn't get chance to look into it, but seems it's gnilnd right? It's not in Lustre source tree yet, where can we find source code?

Doug, I've added you to CC list, could you please look into it? I'm quite busy in recent a few days and have to attend a two days
meeting since tomorrow.

Comment by Cliff White (Inactive) [ 22/Jun/12 ]

Hi Cliff,
for your information, on weisshorn02 you can find in /var/log/cluster.log all the collected log from the entire cluster.
And if you run the command 'ltop' you can have a look of the on line performance.

What do you think about these parameters to pass to router/client/servers does them make any sense to you?

    1. Router

options ko2iblnd timeout=100 peer_timeout=130
options ko2iblnd credits=2048 ntx=2048
options ko2iblnd peer_credits=126 concurrent_sends=63 peer_buffer_credits=128
options kgnilnd credits=2048 peer_health=1

options lnet check_routers_before_use=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet router_ping_timeout=50
options lnet large_router_buffers=1024 small_router_buffers=16384

    1. Server

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
options ko2iblnd peer_credits=126 concurrent_sends=63

options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet check_routers_before_use=1

    1. Client

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
options ko2iblnd peer_credits=126 concurrent_sends=63

options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet check_routers_before_use=1

Comment by Cliff White (Inactive) [ 22/Jun/12 ]

The area of concern for me is this:
#cat /proc/sys/lnet/peers

nid                      refs state  last   max   rtr   min    tx   min queue
....
148.187.7.71@o2ib2          6    up  9999     8     8     8     5  -177 1824
148.187.7.72@o2ib2          7    up  9999     8     8     8     5  -177 1824
148.187.7.73@o2ib2          5    up  9999     8     8     8     6  -175 1216
148.187.7.74@o2ib2          6    up  9999     8     8     8     5  -177 1824
148.187.7.75@o2ib2          6    up  9999     8     8     8     5  -178 1824
148.187.7.76@o2ib2          8    up  9999     8     8     8     4  -177 2432
148.187.7.77@o2ib2          6    up  9999     8     8     8     5  -177 1824
148.187.7.78@o2ib2          7    up  9999     8     8     8     5  -178 1824
148.187.7.79@o2ib2          7    up  9999     8     8     8     4  -176 2432
148.187.7.80@o2ib2          7    up  9999     8     8     8     4  -176 2432
148.187.7.81@o2ib2          6    up  9999     8     8     8     5  -178 1824
148.187.7.82@o2ib2          7    up  9999     8     8     8     5  -178 1824
...

Those are the gni routers, and that is not especially normal.
All of your options that have the word 'check' in them are good, normal.
We think the peer_credits is high, we would advise

options ko2iblnd peer_credits=16 concurrent_sends=16

The 'min' parameter, when negative indicates the number of queued messages, the 'queue' indicates
the number of bytes queued from that peer. So that is an indication things are backed up.

Comment by Doug Oucharek (Inactive) [ 22/Jun/12 ]

I'm suspecting that the messages_router_node.log gives a good indication of what may be happening:

At:
12:28:23 - problems communicating with one node is reported.
12:28:28 - hardware quiesce is reported. I'm not familiar with the gnilnd (or Gemini), but this implies to me that a hardware bit in Gemini has flipped telling software to back off...things are not good.
12:28:40 - All the gnilnd threads are now paused due to the hardware quiesce. At this point, timeouts and message drops are rampant. That would be due to everything being paused.
12:28:46 - All threads are back up and running again allowing traffic to flow.

So, this is how I read what is going on (but a Gemini expert is needed to clarify):
The node which has a hardware fault somehow triggers back pressure on the gnilnd of the router. This trigger the hardware to complain forcing the gnilnd to shut down all traffic on that interface for 6 seconds! In that time, all the IB queues which need to forward to that interface will back up possibly causing many critical message drops. It is hard to say how long after the 6 second lock down it takes for everything to recover.

Comment by Nicola Bianchi [ 25/Jun/12 ]

Dear Support,

this morning: Jun 25 11:51:18 the MDS crashed due a kernel panic (weisshorn01) just after the crash for unknown reasons of the OSS weisshorn14.
All logs are on weisshorn02:/var/log/cluster.log

Regards
Nicola

Comment by Cliff White (Inactive) [ 25/Jun/12 ]

Crash one:
un 4 10:10:57 weisshorn01 kernel: LustreError: 4365:0:(mdd_object.c:635:mdd_big_lmm_get()) ASSERTION( ma->ma_lmm_size > 0 ) failed:
Jun 4 10:10:57 weisshorn01 kernel: LustreError: 4365:0:(mdd_object.c:635:mdd_big_lmm_get()) LBUG
Jun 4 10:10:57 weisshorn01 kernel: Pid: 4365, comm: mdt_09
Jun 4 10:10:57 weisshorn01 kernel:
Jun 4 10:10:57 weisshorn01 kernel: Call Trace:
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa03c0915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa03c0e47>] lbug_with_loc+0x47/0xb0 [libcfs]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bf29e3>] mdd_big_lmm_get+0x433/0x4f0 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bfb9a0>] ? mdd_get_md+0xa0/0x2d0 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bf34ee>] __mdd_lmm_get+0x1ce/0x2c0 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bf6c09>] mdd_attr_get_internal+0x249/0x770 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0d1279f>] ? osd_object_read_lock+0x9f/0x140 [osd_ldiskfs]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bf7188>] mdd_attr_get_internal_locked+0x58/0x80 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0604af0>] ? ldlm_completion_ast+0x0/0x6d0 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c6b3c0>] ? mdt_blocking_ast+0x0/0x210 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0bf71ed>] mdd_attr_get+0x3d/0xa0 [mdd]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0d618bc>] cml_attr_get+0x6c/0x160 [cmm]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0625d84>] ? lustre_msg_get_opc+0x64/0xa0 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c77884>] mdt_getattr_internal+0x294/0xd00 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c7adf5>] mdt_getattr_name_lock+0xd25/0x1700 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa062607d>] ? lustre_msg_buf+0x5d/0x60 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa064b756>] ? __req_capsule_get+0x176/0x640 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0627d74>] ? lustre_msg_get_flags+0x34/0x70 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c7bcdd>] mdt_intent_getattr+0x2cd/0x4a0 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c79591>] mdt_intent_policy+0x2d1/0x600 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa05eae69>] ldlm_lock_enqueue+0x2f9/0x830 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa060c197>] ldlm_handle_enqueue0+0x427/0xda0 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c791d6>] mdt_enqueue+0x46/0x130 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c7184d>] mdt_handle_common+0x74d/0x1400 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0c725d5>] mdt_regular_handle+0x15/0x20 [mdt]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa06333c1>] ptlrpc_server_handle_request+0x3c1/0xcb0 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa03c14ce>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa03cbef9>] ? lc_watchdog_touch+0x79/0x110 [libcfs]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa062d462>] ? ptlrpc_wait_event+0xb2/0x2c0 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa06343cf>] ptlrpc_main+0x71f/0x1210 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0633cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jun 4 10:10:57 weisshorn01 kernel: [<ffffffffa0633cb0>] ? ptlrpc_main+0x0/0x1210 [ptlrpc]
Jun 4 10:12:28 weisshorn02 heartbeat: [3260]: WARN: node weisshorn01.admin.cscs.ch: is dead

Comment by Cliff White (Inactive) [ 25/Jun/12 ]

Second crash:

Jun 25 11:51:18 weisshorn01 kernel: LustreError: 3589:0:(ldlm_lock.c:831:ldlm_lock_decref_and_cancel()) ASSERTION( lock != ((void *)0) ) failed:
Jun 25 11:51:18 weisshorn01 kernel: LustreError: 3589:0:(ldlm_lock.c:831:ldlm_lock_decref_and_cancel()) LBUG
Jun 25 11:51:18 weisshorn01 kernel: Pid: 3589, comm: mgs_scratch_not
Jun 25 11:51:18 weisshorn01 kernel:
Jun 25 11:51:18 weisshorn01 kernel: Call Trace:
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa03bc915>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa03bce47>] lbug_with_loc+0x47/0xb0 [libcfs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa05e92e1>] ldlm_lock_decref_and_cancel+0x111/0x120 [ptlrpc]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b6448b>] mgs_completion_ast_ir+0xfb/0x110 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0600810>] ldlm_cli_enqueue_local+0x1f0/0x4d0 [ptlrpc]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b64390>] ? mgs_completion_ast_ir+0x0/0x110 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa05ff940>] ? ldlm_blocking_ast+0x0/0x130 [ptlrpc]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b6429f>] mgs_revoke_lock+0x13f/0x230 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa05ff940>] ? ldlm_blocking_ast+0x0/0x130 [ptlrpc]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b64390>] ? mgs_completion_ast_ir+0x0/0x110 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa03c65c1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b7a4ec>] mgs_ir_notify+0x11c/0x230 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffff81007966>] ? xen_timer_interrupt+0x16/0x1b0
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b7a3d0>] ? mgs_ir_notify+0x0/0x230 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffff8100c14a>] child_rip+0xa/0x20
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b7a3d0>] ? mgs_ir_notify+0x0/0x230 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffffa0b7a3d0>] ? mgs_ir_notify+0x0/0x230 [mgs]
Jun 25 11:51:18 weisshorn01 kernel: [<ffffffff8100c140>] ? child_rip+0x0/0x20
Jun 25 11:51:18 weisshorn01 kernel:
Jun 25 11:51:18 weisshorn01 kernel: Kernel panic - not syncing: LBUG
Jun 25 11:51:18 weisshorn01 kernel: Pid: 3589, comm: mgs_scratch_not Not tainted 2.6.32-220.7.1.el6_lustre.g9c8f747.x86_64 #1

Comment by Cliff White (Inactive) [ 25/Jun/12 ]

The first LBUG appears to be this:http://jira.whamcloud.com/browse/LU-1384

Comment by Cliff White (Inactive) [ 25/Jun/12 ]

It seems that you are running a patched version of 2.2 because Lustre-2.2.51-2.6.32_220.7.1.el6_lustre.g9c8f747.x86_64_gd2c1a39.x86_64 indicates a version of Lustre taken from gerrit rather than an official release. Which issue did this address? Are you comfortable with how to apply patches to your Lustre version so that you can also add the fix for LU-1384?

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,
my colleague told me that we are using the version Lustre-2.2.51-2.6.32_220.7.1.el6_lustre.g9c8f747.x86_64_gd2c1a39.x86_64 because this was the only one that we were able to use on our sandy-bridge OSSes. With other versions we experienced instability or inability to boot the servers.

About the patches I guess that with some procedure we should be able to get it done.
Anyway, we have for sure to wait at least for our next maintenance window.

Regards
Nicola

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,
regarding your suggestion for this parameter:

options ko2iblnd peer_credits=16 concurrent_sends=16

Currently we don't specified anything about that in our modprobe.conf, we use the configuration posted by Fabio Verzelloni on this date: 12/Jun/12 3:44 AM.
The parameters you posted the "22/Jun/12 10:09 AM" , from a Fabio email, are suggestions found in a 3rd party manual... and we were wondering if some of these options could improve our situation.

Regards
Nicola

Comment by Cliff White (Inactive) [ 26/Jun/12 ]

We really would like you to try options ko2iblnd peer_credits=16 concurrent_sends=16
We think that may improve the situation.
The other options you have listed from the 3rd party manual will not help, and setting the large values for peer_credits and concurrent_sends in that list will likely make things worse.

Please try our suggestion and report your results.

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,

it's ok with you if we will put in the configuration the parameter you suggested in the next maintenance day we planned (4th July)?

Regards
Nicola

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,
right now we are experiencing some performance issue on the filesystem.
Users are complaining about a drop down of the performance and in fact I can see that a simple "ls /scratch/weisshorn" can take minutes to answer.

In the logs there are a bunch of this messages:

---------------------------------------------------------------------------------------------------
Jun 26 13:50:59 weisshorn02 kernel: LustreError: 9825:0:(lov_obd.c:1068:lov_clear_orphans()) error in orphan recovery on OST idx 65/72: rc = -5
Jun 26 13:50:59 weisshorn02 kernel: LustreError: 9825:0:(mds_lov.c:884:__mds_lov_synchronize()) scratch-OST0041_UUID failed at mds_lov_clear_orphans: -5
Jun 26 13:50:59 weisshorn02 kernel: LustreError: 9825:0:(mds_lov.c:905:__mds_lov_synchronize()) scratch-OST0041_UUID sync failed -5, deactivating
Jun 26 13:51:01 weisshorn02 /usr/sbin/cerebrod[2756]: lmt_mysql: failed to connect to database
Jun 26 13:51:04 weisshorn11 kernel: Lustre: 5917:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340711408/real 1340711408] req@ffff880807301400 x1405364552083621/t0(0) o250->MGC148.187.7.101@o2ib2@148.187.7.101@o2ib2:26/25 lens 368/512 e 0 to 1 dl 1340711464 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 26 13:51:04 weisshorn11 kernel: Lustre: 5917:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
Jun 26 13:51:07 weisshorn02 kernel: LustreError: 9835:0:(lov_obd.c:1068:lov_clear_orphans()) error in orphan recovery on OST idx 25/72: rc = -5
Jun 26 13:51:07 weisshorn02 kernel: LustreError: 9835:0:(mds_lov.c:884:__mds_lov_synchronize()) scratch-OST0019_UUID failed at mds_lov_clear_orphans: -5
Jun 26 13:51:07 weisshorn02 kernel: LustreError: 9835:0:(mds_lov.c:905:__mds_lov_synchronize()) scratch-OST0019_UUID sync failed -5, deactivating
Jun 26 13:51:13 weisshorn04 kernel: Lustre: 5655:0:(client.c:1762:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1340711417/real 1340711417] req@ffff880feec76000 x1405364532655945/t0(0) o250->MGC148.187.7.101@o2ib2@148.187.7.101@o2ib2:26/25 lens 368/512 e 0 to 1 dl 1340711473 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Jun 26 13:51:13 weisshorn04 kernel: Lustre: 5655:0:(client.c:1762:ptlrpc_expire_one_request()) Skipped 8 previous similar messages
Jun 26 13:51:16 weisshorn02 /usr/sbin/cerebrod[2756]: lmt_mysql: failed to connect to database
---------------------------------------------------------------------------------------------------

We don't see any issue with the network, neither IB nor GNI.

Machine load:
---------------------------------------------------------------------------------------------------
weisshorn05: 14:01:32 up 5 days, 5:41, 0 users, load average: 3.12, 3.61, 3.85
weisshorn14: 14:01:32 up 46 min, 0 users, load average: 0.02, 0.01, 0.00 <----- out of production
weisshorn03: 14:01:32 up 5 days, 5:41, 0 users, load average: 5.26, 4.41, 3.86
weisshorn07: 14:01:32 up 5 days, 5:41, 0 users, load average: 3.21, 3.41, 3.95
weisshorn04: 14:01:32 up 5 days, 5:41, 0 users, load average: 4.47, 5.05, 4.90
weisshorn06: 14:01:32 up 5 days, 5:41, 0 users, load average: 4.32, 4.04, 4.08
weisshorn08: 14:01:32 up 5 days, 5:40, 0 users, load average: 5.11, 4.73, 4.91
weisshorn01: 14:01:32 up 1 day, 1:53, 0 users, load average: 0.10, 0.03, 0.01 <---- out of production
weisshorn09: 14:01:32 up 5 days, 5:41, 0 users, load average: 3.83, 4.29, 4.68
weisshorn11: 14:01:32 up 5 days, 5:40, 0 users, load average: 3.29, 3.69, 3.73
weisshorn12: 14:01:32 up 5 days, 5:40, 0 users, load average: 3.37, 3.26, 3.06
weisshorn10: 14:01:32 up 5 days, 5:40, 0 users, load average: 4.26, 3.72, 3.36
weisshorn13: 14:01:32 up 5 days, 5:40, 0 users, load average: 9.32, 8.07, 7.24
weisshorn02: 14:01:32 up 5 days, 5:38, 5 users, load average: 0.26, 1.88, 2.58 <--- MDS
---------------------------------------------------------------------------------------------------

ltop screenshot:
---------------------------------------------------------------------------------------------------
Filesystem: scratch
Inodes: 185.812m total, 16.950m used ( 9%), 168.863m free
Space: 503.720t total, 296.873t used ( 59%), 206.847t free
Bytes/s: 0.323g read, 0.285g write, 2711 IOPS
MDops/s: 245 open, 262 close, 14 getattr, 6 setattr
0 link, 1 unlink, 0 mkdir, 0 rmdir
0 statfs, 3 rename, 0 getxattr
>OST S OSS Exp CR rMB/s wMB/s IOPS LOCKS LGR LCR %cpu %mem %spc
(12) F eisshorn13 1551 0 56 41 525 1156010 9 363 1 100 61
(6) F eisshorn03 1551 0 28 28 127 560595 7 2 0 100 59
(6) F eisshorn04 1551 0 37 36 110 614243 0 0 0 100 56
(6) F eisshorn05 1551 0 37 25 182 621734 0 0 0 100 61
(6) F eisshorn06 1551 0 22 21 374 597840 0 0 1 100 57
(6) F eisshorn07 1551 0 22 16 234 653836 2 0 0 100 60
(6) F eisshorn08 1551 0 24 31 103 713393 3 117 0 100 53
(6) F eisshorn09 1551 0 26 25 344 580579 2 98 1 100 55
(6) F eisshorn10 1551 0 28 27 445 595866 1 164 1 100 59
(6) F eisshorn11 1551 0 24 19 169 565911 4 4 0 100 62
(6) F eisshorn12 1551 0 27 24 98 696995 2 152 0 100 62
---------------------------------------------------------------------------------------------------

cat /proc/sys/lnet/peers
---------------------------------------------------------------------------------------------------
148.187.7.71@o2ib2 3 up 9999 8 8 8 8 -611 0
148.187.7.72@o2ib2 3 up 9999 8 8 8 8 -592 0
148.187.7.73@o2ib2 3 up 9999 8 8 8 8 -633 0
148.187.7.74@o2ib2 3 up 9999 8 8 8 8 -608 0
148.187.7.75@o2ib2 3 up 9999 8 8 8 8 -617 0
148.187.7.76@o2ib2 3 up 9999 8 8 8 8 -642 0
148.187.7.77@o2ib2 3 up 9999 8 8 8 8 -618 0
148.187.7.78@o2ib2 3 up 9999 8 8 8 8 -626 0
148.187.7.79@o2ib2 3 up 9999 8 8 8 8 -636 0
148.187.7.80@o2ib2 3 up 9999 8 8 8 8 -630 0
148.187.7.81@o2ib2 3 up 9999 8 8 8 8 -636 0
148.187.7.82@o2ib2 3 up 9999 8 8 8 8 -635 0
---------------------------------------------------------------------------------------------------

Regards
Nicola

Comment by Cliff White (Inactive) [ 26/Jun/12 ]

You are reporting issues every day - i would NOT wait for the maint window to change the configuration. You could do a rolling failover and set this up without halting the filesystem, simply failover one node at a time.

Comment by Cliff White (Inactive) [ 26/Jun/12 ]

The current issue may be related to LU-1247, am consulting engineering. I think we need to point you to a newer release, and should plan on a system upgrade on the 4th.

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,
do I have to put this line only on the servers? (OSS and MDS)

options ko2iblnd peer_credits=16 concurrent_sends=16

Regards
Nicola

Comment by Cliff White (Inactive) [ 26/Jun/12 ]

I am sorry, I was in error. The "options ko2iblnd peer_credits=16 concurrent_sends=16" needs to be set on all nodes, client and server. You would have to stop the clients for a moment to do this, the rolling failover idea could be used for servers, but likely better to do all at once.

Comment by Nicola Bianchi [ 26/Jun/12 ]

Cliff,
is it a real problem if I begin to put that on the servers only?

Nicola

Comment by Oleg Drokin [ 27/Jun/12 ]

the second crash is LU-1259, patch at http://review.whamcloud.com/2390

Comment by Cliff White (Inactive) [ 28/Jun/12 ]

This via email:
Cliff,
thanks a lot for the informations.

Hence the next Wednesday Fabio, who read us in copy, will proceed with the downgrade to the 2.1.2 version. Since this afternoon I will be away for 7 weeks so he will take over this task ...

For sure Fabio will come back to you about the details and the schedule for this intervention.

------

Comment by Cliff White (Inactive) [ 29/Jun/12 ]

Okay, if you need help on the 4th, please email
dutymanager@whamcloud.com in the event of problems during the upgrade.
It is a US holiday, so some of us will not be available.

Comment by Cliff White (Inactive) [ 29/Jun/12 ]

And of course, the bits you need are available here: http://downloads.whamcloud.com/public/lustre/latest-maintenance-release/

Comment by Fabio Verzelloni [ 02/Jul/12 ]

I just tried on a test machine to recreate the situation will happen on Wednesday, basically downgrade from 2.2 --> 2.1.2 and I've got the following error message:

console log

Jul 02 16:08 [root@wn47:~]# mount -t lustre /dev/vg_root/mds /mnt/lustre
mount.lustre: mount /dev/mapper/vg_root-mds at /mnt/lustre failed: Invalid argument
This may have multiple causes.
Are the mount options correct?
Check the syslog for more info.

messages

Jul 2 16:10:17 wn47 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
Jul 2 16:10:17 wn47 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts:
Jul 2 16:10:17 wn47 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_fill_super: extents feature not enabled on this filesystem, use tune2fs.
Jul 2 16:10:17 wn47 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts:
Jul 2 16:10:17 wn47 kernel: Lustre: MGS MGS started
Jul 2 16:10:17 wn47 kernel: Lustre: 8536:0:(ldlm_lib.c:933:target_handle_connect()) MGS: connection from bd7d7986-674e-84d2-d489-d6f9028b0d52@0@lo t0 exp (null) cur 1341238217 last 0
Jul 2 16:10:17 wn47 kernel: Lustre: MGC10.10.65.47@tcp: Reactivating import
Jul 2 16:10:17 wn47 kernel: Lustre: Enabling ACL
Jul 2 16:10:17 wn47 kernel: Lustre: lustre-MDT0000: used disk, loading
Jul 2 16:10:17 wn47 kernel: LustreError: 8540:0:(mdt_recovery.c:409:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 200
Jul 2 16:10:17 wn47 kernel: LustreError: 8540:0:(obd_config.c:522:class_setup()) setup lustre-MDT0000 failed (-22)
Jul 2 16:10:17 wn47 kernel: LustreError: 8540:0:(obd_config.c:1363:class_config_llog_handler()) Err -22 on cfg command:
Jul 2 16:10:17 wn47 kernel: Lustre: cmd=cf003 0:lustre-MDT0000 1:lustre-MDT0000_UUID 2:0 3:lustre-MDT0000-mdtlov 4:f
Jul 2 16:10:17 wn47 kernel: LustreError: 15b-f: MGC10.10.65.47@tcp: The configuration from log 'lustre-MDT0000'failed from the MGS (-22). Make sure this client and the MGS are running compatible versions of Lustre.
Jul 2 16:10:17 wn47 kernel: LustreError: 15c-8: MGC10.10.65.47@tcp: The configuration from log 'lustre-MDT0000' failed (-22). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Jul 2 16:10:17 wn47 kernel: LustreError: 8458:0:(obd_mount.c:1192:server_start_targets()) failed to start server lustre-MDT0000: -22
Jul 2 16:10:17 wn47 kernel: LustreError: 8458:0:(obd_mount.c:1738:server_fill_super()) Unable to start targets: -22
Jul 2 16:10:17 wn47 kernel: LustreError: 8458:0:(obd_config.c:567:class_cleanup()) Device 3 not setup
Jul 2 16:10:17 wn47 kernel: LustreError: 8458:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Jul 2 16:10:17 wn47 kernel: Lustre: MGS has stopped.
Jul 2 16:10:17 wn47 kernel: LustreError: 8458:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Jul 2 16:10:23 wn47 kernel: Lustre: 8458:0:(client.c:1780:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1341238217/real 1341238217] req@ffff88041bdcac00 x1406387715309722/t0(0) o251->MGC10.10.65.47@tcp@0@lo:26/25 lens 192/192 e 0 to 1 dl 1341238223 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Jul 2 16:10:23 wn47 kernel: Lustre: server umount lustre-MDT0000 complete
Jul 2 16:10:23 wn47 kernel: LustreError: 8458:0:(obd_mount.c:2198:lustre_fill_super()) Unable to mount (-22)
Jul 2 16:10:23 wn47 kernel: LustreError: 8458:0:(obd_mount.c:2198:lustre_fill_super()) Skipped 1 previous similar message

And attached you can find the lustre dk

Fabio

Comment by Cliff White (Inactive) [ 02/Jul/12 ]

Please tell us exactly what you did, list all steps if possible.

  • Did you download the 2.1.2 release from our site?
  • Did you install the latest e2fsprogs, also from our site?
Comment by Johann Lombardi (Inactive) [ 03/Jul/12 ]

Hi Fabio,

As far as upgrade/downgrade is concerned, we only support the downgrade of a filesystem that was formatted with an "old" version of lustre. I mean that, you can format a filesystem with 2.x, upgrade it to 2.x+1 and then downgrade to 2.x. However, if you format with 2.x+1, we might enable new features (e.g. flex_bg is the first example i can think of) which won't be supported by prior releases. This has been a general rule with lustre releases for a while.

In the present case, you can't mount the filesystem because of this message:

Jul 2 16:10:17 wn47 kernel: LustreError: 8540:0:(mdt_recovery.c:409:mdt_server_data_init()) lustre-MDT0000: unsupported incompat filesystem feature(s) 200

The incompat feature is actually multiple object index support which was added in 2.2 (see LU-822).
Because of this new feature which is enabled by default for any new filesystem formatted with 2.2, you can't downgrade to 2.1.

HTH

Comment by Colin McMurtrie [ 05/Jul/12 ]

Johann,

Unfortunately this latest piece of information (namely the fact that the filesystem needs to be reformatted because of incompatibilities between it and a v2.1.2) came too late for us to adequately inform our user community. The filesystem contained nearly 400TB of user data and we must allow at least 1 weeks advance notice so that users can move their crucial data to a more permanent filesystem. Consequently we will not be able to perform this downgrade until mid-July at the earliest. In fact we would prefer to wait until our next scheduled maintenance on Wednesday 8 August. This is a regrettable situation.

On the plus side we have unmounted the second Lustre filesystem from the 1500 node XE6 (which was an additional complication) and this seems to have made the Lustre logs less "chatty" (i.e. not logging as many problems). The XE6 is however not completely full of jobs yet so we will keep a close eye on the situation over the coming days. Furthermore one of the OSSes needed a motherboard swap last week so it may have had a hardware fault that was also contributing to destabilise the filesystem.

Finally FYI we have ordered 4 additional OSSes of identical spec to the ones we have in production. When they arrive (likely later in July) we will build a preproduction test environment (we have some unused DS4800 storage to use at the backend).

Regards,

Colin

Comment by Cliff White (Inactive) [ 05/Jul/12 ]

Thanks, please let us know what we can do to assist.

Comment by Fabio Verzelloni [ 06/Jul/12 ]

I have two questions regarding the issue of the reconnecting compute nodes between Lustre & Cray (gni interconnect), as my latest attached file we had this issue yesterday, the question is Lustre version 2.1.2 is running the same kind of code for connecting to gni that Lustre 2.2 is using?

If the code is the same regarding the gni part will be really helpful to downgrade or we would experiencing the same issue?

The second question is, is it possible that a tuning of the router/client nodes side, maybe increasing the timeout in the modprobe would help?

Right now we have on the router node a timeout=250 but there is not any specification on the client side.

Regards
Fabio

Comment by Cliff White (Inactive) [ 09/Jul/12 ]

The gnilnd code is supplied by Cray - have you consulted them on this issue? We are at the present time unable to access the gnilnd source, so it is a bit difficult for us to say anything on this issue. We believe you would be better off in general running our current maintenance release (2.1.2) as you are currently hitting two known bugs that are not present in this release. Given your current state of Lustre experience 2.1.2 will be in general more stable. If you wish to continue with the feature release (2.2), you will need to patch your version of Lustre with the fixes for the bugs you have hit.

Comment by Cory Spitz [ 09/Jul/12 ]

Cray has been planning on pushing gnilnd upstream. We've opened LU-1419 to track that effort.

For this ticket though, I think that it would help to be clear about what components and Lustre version you are discussing. It's my guess that this customer is still running 1.8.x clients on their Cray mainframe, and not version 2.2 or 2.1.2. I believe that in the course of this ticket, when 2.x is referenced, it refers to the server version only.

BTW, the gnilnd has a feature to "quiesce" when the Cray high speed network is under a re-route for HW warmswap or link failure. If the Cray has gone quiescent, then you won't hear from its clients until the quiesce event is complete. During that time, messages to the Cray clients will probably even back up on the LNET routers such that all router buffers are consumed.

Comment by Fabio Verzelloni [ 12/Jul/12 ]

I've just uploaded 2 new log files, related to a problem occurred yesterday, a lot of clients from our XE6 system lost connectivity with Lustre.

I hope the logs help, log_11_jul is the one from the lustre servers (weisshorn07) the drop_conn is the log from Cray XE6, from 19:00 related to the error with weisshorn07.

Comment by Fabio Verzelloni [ 16/Jul/12 ]

Tomorrow we are going to apply some tuning on Server, router & client side, the following are the changes we are going to apply:

    1. Router ##

options ko2iblnd timeout=100 peer_timeout=130
options ko2iblnd credits=2048 ntx=2048
options ko2iblnd peer_credits=126 concurrent_sends=63
peer_buffer_credits=128
options kgnilnd credits=2048 peer_health=1

options lnet check_routers_before_use=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet router_ping_timeout=50
options lnet large_router_buffers=1024 small_router_buffers=16384

    1. Server ##

options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
options ko2iblnd credits=2048 ntx=2048
options ko2iblnd peer_credits=126 concurrent_sends=63

options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet check_routers_before_use=1

    1. Client ##

LOGIN - COMPUTE

No need ko2ilbnd tuning on ROSA Login nodes because there are not IB cards.
#options ko2iblnd timeout=100 peer_timeout=0 keepalive=30
#options ko2iblnd credits=2048 ntx=2048
#options ko2iblnd peer_credits=126 concurrent_sends=63

options lnet avoid_asym_router_failure=1
options lnet dead_router_check_interval=60
options lnet live_router_check_interval=60
options lnet check_routers_before_use=1

    1. tuning ##

MDS

echo "100" > /proc/fs/lustre/lov/scratch-MDT0000-mdtlov/qos_threshold_rr
echo "256" > /proc/fs/lustre/mdt/scratch-MDT0000/mdt/threads_max
lctl set_param lov.*.stripecount=4
echo "240" > /proc/fs/lustre/mdt/scratch-MDT0000/identity_acquire_expire

OSS

pdsh -a -x weisshorn[01-02] "lctl set_param timeout=300"
pdsh -a -x weisshorn[01-02] "lctl set_param ldlm_timeout=230"
pdsh -a -x weisshorn[01-02] "lctl set_param at_min=230"

Can you please confirm me that the qos_threshold is enough to tune only on the MDS and no need of changes on the client side?
I noticed that the value on the clients is slightly different than the one on the server:

fverzell@rosa5:~> cat /proc/fs/lustre/lov/scratch-clilov-ffff8803fc233000/qos_threshold_rr
16%

[root@weisshorn01 ~]# cat /proc/fs/lustre/lov/scratch-MDT0000-mdtlov/qos_threshold_rr
17%

Based on the fact that the FS seems to be unbalanced, it is necessary lfs_migrate to 'fix' the situation?

fverzell@rosa5:~> lfs df
UUID 1K-blocks Used Available Use% Mounted on
scratch-MDT0000_UUID 292202572 118544144 154174788 43% /scratch/weisshorn[MDT:0]
scratch-OST0000_UUID 7512024488 4689026804 2447151520 66% /scratch/weisshorn[OST:0]
scratch-OST0001_UUID 7512024488 4745725856 2390486948 67% /scratch/weisshorn[OST:1]
scratch-OST0002_UUID 7512024488 5075884136 2060329692 71% /scratch/weisshorn[OST:2]
scratch-OST0003_UUID 7512024488 4897558604 2238653176 69% /scratch/weisshorn[OST:3]
scratch-OST0004_UUID 7512024488 4454906816 2681307012 62% /scratch/weisshorn[OST:4]
scratch-OST0005_UUID 7512024488 4587438680 2548773100 64% /scratch/weisshorn[OST:5]
scratch-OST0006_UUID 7512024488 3992329488 3143885212 56% /scratch/weisshorn[OST:6]
scratch-OST0007_UUID 7512024488 4661958960 2474223000 65% /scratch/weisshorn[OST:7]
scratch-OST0008_UUID 7512024488 4167201580 2969012160 58% /scratch/weisshorn[OST:8]
scratch-OST0009_UUID 7512024488 4819927884 2316257380 68% /scratch/weisshorn[OST:9]
scratch-OST000a_UUID 7512024488 4754286656 2381901152 67% /scratch/weisshorn[OST:10]
scratch-OST000b_UUID 7512024488 4640072608 2496099572 65% /scratch/weisshorn[OST:11]
scratch-OST000c_UUID 7512024488 4742974012 2393214728 66% /scratch/weisshorn[OST:12]
scratch-OST000d_UUID 7512024488 2549614552 4586599524 36% /scratch/weisshorn[OST:13]
scratch-OST000e_UUID 7512024488 4827107008 2309104772 68% /scratch/weisshorn[OST:14]
scratch-OST000f_UUID 7512024488 4624081852 2512129928 65% /scratch/weisshorn[OST:15]
scratch-OST0010_UUID 7512024488 4579714656 2556475388 64% /scratch/weisshorn[OST:16]
scratch-OST0011_UUID 7512024488 4712685460 2423526320 66% /scratch/weisshorn[OST:17]
scratch-OST0012_UUID 7512024488 4467238040 2668940296 63% /scratch/weisshorn[OST:18]
scratch-OST0013_UUID 7512024488 3964966088 3171248460 56% /scratch/weisshorn[OST:19]
scratch-OST0014_UUID 7512024488 4467610568 2668572328 63% /scratch/weisshorn[OST:20]
scratch-OST0015_UUID 7512024488 4087648640 3048563952 57% /scratch/weisshorn[OST:21]
scratch-OST0016_UUID 7512024488 2768510532 4367704256 39% /scratch/weisshorn[OST:22]
scratch-OST0017_UUID 7512024488 4762759028 2373420232 67% /scratch/weisshorn[OST:23]
scratch-OST0018_UUID 7512024488 4818779532 2317404560 68% /scratch/weisshorn[OST:24]
scratch-OST0019_UUID 7512024488 2286805384 4849409388 32% /scratch/weisshorn[OST:25]
scratch-OST001a_UUID 7512024488 4654062568 2482120052 65% /scratch/weisshorn[OST:26]
scratch-OST001b_UUID 7512024488 4716531932 2419679848 66% /scratch/weisshorn[OST:27]
scratch-OST001c_UUID 7512024488 4829306016 2306855832 68% /scratch/weisshorn[OST:28]
scratch-OST001d_UUID 7512024488 4724954000 2411258804 66% /scratch/weisshorn[OST:29]
scratch-OST001e_UUID 7512024488 4734034796 2402147544 66% /scratch/weisshorn[OST:30]
scratch-OST001f_UUID 7512024488 4764160748 2372052056 67% /scratch/weisshorn[OST:31]
scratch-OST0020_UUID 7512024488 4749913620 2386283024 67% /scratch/weisshorn[OST:32]
scratch-OST0021_UUID 7512024488 4051468944 3084745524 57% /scratch/weisshorn[OST:33]
scratch-OST0022_UUID 7512024488 2519479924 4616734864 35% /scratch/weisshorn[OST:34]
scratch-OST0023_UUID 7512024488 5047050388 2089132448 71% /scratch/weisshorn[OST:35]
scratch-OST0024_UUID 7512024488 4777844024 2358343904 67% /scratch/weisshorn[OST:36]
scratch-OST0025_UUID 7512024488 2515885904 4620324508 35% /scratch/weisshorn[OST:37]
scratch-OST0026_UUID 7512024488 4612974860 2523237944 65% /scratch/weisshorn[OST:38]
scratch-OST0027_UUID 7512024488 4766933908 2369277872 67% /scratch/weisshorn[OST:39]
scratch-OST0028_UUID 7512024488 4732878372 2403333408 66% /scratch/weisshorn[OST:40]
scratch-OST0029_UUID 7512024488 4805835488 2330355492 67% /scratch/weisshorn[OST:41]
scratch-OST002a_UUID 7512024488 4623393332 2512795728 65% /scratch/weisshorn[OST:42]
scratch-OST002b_UUID 7512024488 4817231168 2318980612 68% /scratch/weisshorn[OST:43]
scratch-OST002c_UUID 7512024488 2429377284 4706837500 34% /scratch/weisshorn[OST:44]
scratch-OST002d_UUID 7512024488 4825606516 2310560384 68% /scratch/weisshorn[OST:45]
scratch-OST002e_UUID 7512024488 2666838228 4469376560 37% /scratch/weisshorn[OST:46]
scratch-OST002f_UUID 7512024488 3925127408 3211050572 55% /scratch/weisshorn[OST:47]
scratch-OST0030_UUID 7512024488 4727638656 2408540452 66% /scratch/weisshorn[OST:48]
scratch-OST0031_UUID 7512024488 2693940128 4442271700 38% /scratch/weisshorn[OST:49]
scratch-OST0032_UUID 7512024488 4764748760 2371463020 67% /scratch/weisshorn[OST:50]
scratch-OST0033_UUID 7512024488 4864564316 2271648488 68% /scratch/weisshorn[OST:51]
scratch-OST0034_UUID 7512024488 4625740760 2510443620 65% /scratch/weisshorn[OST:52]
scratch-OST0035_UUID 7512024488 4693269256 2442910852 66% /scratch/weisshorn[OST:53]
scratch-OST0036_UUID 7512024488 4544146008 2592030580 64% /scratch/weisshorn[OST:54]
scratch-OST0037_UUID 7512024488 4860051744 2276133072 68% /scratch/weisshorn[OST:55]
scratch-OST0038_UUID 7512024488 2854611264 4281603524 40% /scratch/weisshorn[OST:56]
scratch-OST0039_UUID 7512024488 4750769272 2385415720 67% /scratch/weisshorn[OST:57]
scratch-OST003a_UUID 7512024488 2387156708 4749058080 33% /scratch/weisshorn[OST:58]
scratch-OST003b_UUID 7512024488 4971879748 2164295048 70% /scratch/weisshorn[OST:59]
scratch-OST003c_UUID 7512024488 4768532532 2367679248 67% /scratch/weisshorn[OST:60]
scratch-OST003d_UUID 7512024488 2486278472 4649936292 35% /scratch/weisshorn[OST:61]
scratch-OST003e_UUID 7512024488 4799709104 2336473268 67% /scratch/weisshorn[OST:62]
scratch-OST003f_UUID 7512024488 4804003128 2332208652 67% /scratch/weisshorn[OST:63]
scratch-OST0040_UUID 7512024488 4737174800 2399036980 66% /scratch/weisshorn[OST:64]
scratch-OST0041_UUID 7512024488 4570719888 2565491892 64% /scratch/weisshorn[OST:65]
scratch-OST0042_UUID 7512024488 4680780720 2455398272 66% /scratch/weisshorn[OST:66]
scratch-OST0043_UUID 7512024488 4876523204 2259658024 68% /scratch/weisshorn[OST:67]
scratch-OST0044_UUID 7512024488 4016775868 3119437956 56% /scratch/weisshorn[OST:68]
scratch-OST0045_UUID 7512024488 4909041500 2227172328 69% /scratch/weisshorn[OST:69]
scratch-OST0046_UUID 7512024488 4894087388 2242097828 69% /scratch/weisshorn[OST:70]
scratch-OST0047_UUID 7512024488 4750555368 2385615184 67% /scratch/weisshorn[OST:71]

Thanks
Fabio

Comment by Fabio Verzelloni [ 16/Jul/12 ]

this is the second option of tuning to apply:

    1. Router nodes ##

options mlx4_core msi_x=1

options kgnilnd credits=4000 timeout=120

options qla2xxx ql2xlogintimeout=0
options ost oss_num_threads=256
options libcfs libcfs_panic_on_lbug=1

options ko2iblnd credits=1024 ntx=2048

options lnet small_router_buffers=16384
############################################

    1. compute nodes ##

options lnet check_routers_before_use=1
options lnet dead_router_check_interval=150
options lnet live_router_check_interval=150
options lnet router_ping_timeout=130
options kgnilnd credits=1024 timeout=120
############################################

    1. Servers ##

options mlx4_core msi_x=1
options lnet networks=o2ib(ib0)

options lnet check_routers_before_use=1
options lnet router_ping_timeout=130
options lnet dead_router_check_interval=150
options lnet live_router_check_interval=150

options ko2iblnd credits=1024 ntx=2048 peer_credits=8

options libcfs libcfs_panic_on_lbug=1
############################################

There's any pro & cons at a first glance about the first & the second set of tuning we want to try tomorrow?

Fabio

Comment by Cliff White (Inactive) [ 17/Jul/12 ]

The filesystem does not seem especially unbalanced, lfs migrate is used when adding new storage. What exactly is your concern? Normal usage should balance the space out, depending on your mix of workloads. it looks like your average OST is 66% full, with a few at ~30%. That should balance out in normal usage.

Comment by Cliff White (Inactive) [ 17/Jul/12 ]

And yes, QOS parameters are set on the MDS, as MDS controls that allocation.

Comment by Cliff White (Inactive) [ 04/Sep/12 ]

Is there anything more we can do on this issue, or is it okay to close?

Generated at Sat Feb 10 01:17:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.