[LU-7054] ib_cm scalling issue when lustre clients connect to OSS - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.5.3
Labels:
None
Environment:
OFED3.5, MOFED241, and MOFED3.5

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

When a large number of lustre clients (>3000) try connecting to a OSS/MDS at the same, ib_cm threads on the OSS/MDS are unable to services the incoming connection in time. Using ibdump we have seen server replies taking 30sec, by that time the clients have timed out the request and are retrying which results in even more work for ib_cm.

Ib_cm is never able to catchup and usually requires a reboot of the server. Sometime we have been able to recover by ifdowning the ib interface, to give ib_cm time to 'catchup' and then ifuping the interface.

Most of the threads will be in 'D' state here is a example stack trace:

0xffff88062f3c0aa0     1655        2  0    1   D  0xffff88062f3c1140  ib_cm/1^M
sp                ip                Function (args)^M
0xffff880627237a90 0xffffffff81559b50 thread_return^M
0xffff880627237b58 0xffffffff8155b30e __mutex_lock_slowpath+0x13e (0xffff88062f76d260)^M
0xffff880627237bc8 0xffffffff8155b1ab mutex_lock+0x2b (0xffff88062f76d260)^M
0xffff880627237be8 0xffffffffa043f23e [rdma_cm]cma_disable_callback+0x2e (0xffff88062f76d000, unknown)^M
0xffff880627237c18 0xffffffffa044440f [rdma_cm]cma_req_handler+0x8f (0xffff880365eec200, 0xffff880494844698)^M
0xffff880627237d28 0xffffffffa0393e37 [ib_cm]cm_process_work+0x27 (0xffff880365eec200, 0xffff880494844600)^M
0xffff880627237d78 0xffffffffa0394aaa [ib_cm]cm_req_handler+0x6ba (0xffff880494844600)^M
0xffff880627237de8 0xffffffffa0395735 [ib_cm]cm_work_handler+0x145 (0xffff880494844600)^M
0xffff880627237e38 0xffffffff81093f30 worker_thread+0x170 (0xffffe8ffffc431c0)^M
0xffff880627237ee8 0xffffffff8109a106 kthread+0x96 (0xffff880627ae5da8)^M
0xffff880627237f48 0xffffffff8100c20a child_rip+0xa (unknown, unknown)^M

Using systemtap I was able to get a trace of ib_cm it shows a great deal of time is spent in spin_lock_irq. see attached file

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

load.pdf
85 kB
30/Oct/15 5:51 PM
lustre-log.1445147654.68807.gz
0.2 kB
18/Oct/15 10:55 PM
lustre-log.1445147717.68744.gz
0.2 kB
18/Oct/15 10:55 PM
lustre-log.1445147754.68673.gz
0.2 kB
18/Oct/15 10:55 PM
nbp8-os11.var.log.messages.oct.17.gz
27 kB
18/Oct/15 10:44 PM
opensfs-HLDForSMPnodeaffinity-060415-1623-4.pdf
564 kB
18/Sep/15 12:37 AM
read.pdf
92 kB
30/Oct/15 5:51 PM
service104.+net+malloc.gz
0.2 kB
15/Sep/15 7:21 AM
service115.+net.gz
1.04 MB
15/Sep/15 7:21 AM
trace.ib_cm_1rack.out.gz
759 kB
28/Aug/15 12:05 AM
write.pdf
91 kB
30/Oct/15 5:51 PM

Issue Links

is related to

LU-7290 lock callback not getting to client

Resolved

LU-7676 OSS Servers stuck in connecting/disconnect loop

Resolved

Activity

[LU-7054] ib_cm scalling issue when lustre clients connect to OSS

Mahmoud Hanafi added a comment - 30/Oct/15 5:08 AM

Uploaded 2 new debug logs and /var/log/messages to ftp:/uploads/LU7054/
messages.gz
sec.20151029.21.03.25.gz
sec.20151029.21.05.55.gz

At Oct 29 21:03:25 we got this in /var/log/messages

Oct 29 21:03:25 nbp8-oss14 kernel: [1828390.200665] LustreError: 68632:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s  req@ffff8812e7162c00 x1516263417841768/t0(0) o3->d07e5b2b-77f6-1c68-f1fd-e6e4f8f614d7@10.151.57.146@o2ib:0/0 lens 4568/432 e 1 to 0 dl 1446177826 ref 1 fl Interpret:/0/0 rc 0/0

We dump debug logs to sec.20151029.21.03.25.gz
it shows all clients connecting.

00000800:00000200:0.2:1446177579.541489:0:0:0:(o2iblnd_cb.c:3306:kiblnd_cq_completion()) conn[ffff8804e04ad400] (20)++
....
00000800:00000200:1.0:1446177803.579278:0:11134:0:(o2iblnd_cb.c:993:kiblnd_check_sends()) conn[ffff880da89faa00] (31)--

I don't under why the debug logs don't show any other activite before this other than

00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 123 lcw_cb()) Process entered
00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 126 lcw_cb()) Process leaving

Hopping you guys can make sense of it.

Mahmoud Hanafi added a comment - 30/Oct/15 5:08 AM Uploaded 2 new debug logs and /var/log/messages to ftp:/uploads/LU7054/ messages.gz sec.20151029.21.03.25.gz sec.20151029.21.05.55.gz At Oct 29 21:03:25 we got this in /var/log/messages Oct 29 21:03:25 nbp8-oss14 kernel: [1828390.200665] LustreError: 68632:0:(ldlm_lib.c:2715:target_bulk_io()) @@@ timeout on bulk PUT after 150+0s req@ffff8812e7162c00 x1516263417841768/t0(0) o3->d07e5b2b-77f6-1c68-f1fd-e6e4f8f614d7@10.151.57.146@o2ib:0/0 lens 4568/432 e 1 to 0 dl 1446177826 ref 1 fl Interpret:/0/0 rc 0/0 We dump debug logs to sec.20151029.21.03.25.gz it shows all clients connecting. 00000800:00000200:0.2:1446177579.541489:0:0:0:(o2iblnd_cb.c:3306:kiblnd_cq_completion()) conn[ffff8804e04ad400] (20)++ .... 00000800:00000200:1.0:1446177803.579278:0:11134:0:(o2iblnd_cb.c:993:kiblnd_check_sends()) conn[ffff880da89faa00] (31)-- I don't under why the debug logs don't show any other activite before this other than 00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 123 lcw_cb()) Process entered 00000400 00000001 18.1 Thu Oct 29 20:59:39 PDT 2015 0 0 0 (watchdog.c 126 lcw_cb()) Process leaving Hopping you guys can make sense of it.

Mahmoud Hanafi added a comment - 20/Oct/15 7:54 AM

Uploaded lustre debug dump to ftpsite:/uploads/LU7054/nbp9-oss16.ldebug.gz

It shows a lock callback timer eviction at

Oct 19 22:20:39 nbp9-oss16 kernel: LustreError: 11651:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.6.134@o2ib  ns: filter-nbp9-OST003f_UUID lock: ffff8800810877c0/0x15ed8b9f78013cb2 lrc: 3/0,0 mode: PW/PW res: [0x85f72a:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000000020 nid: 10.151.6.134@o2ib remote: 0x810eb6843fa5e1d1 expref: 9 pid: 11647 timeout: 5261835188 lvb_type: 0

From the debug dump we seet at 151 sec before (22:18:08) the servers was running in

o2iblnd_cb.c 3306 kiblnd_cq_completion()) conn[ffff880f0c5d7000] (20)++

for a long time.

It is possible that the call back request was never sent?

Mahmoud Hanafi added a comment - 20/Oct/15 7:54 AM Uploaded lustre debug dump to ftpsite:/uploads/LU7054/nbp9-oss16.ldebug.gz It shows a lock callback timer eviction at Oct 19 22:20:39 nbp9-oss16 kernel: LustreError: 11651:0:(ldlm_lockd.c:346:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.6.134@o2ib ns: filter-nbp9-OST003f_UUID lock: ffff8800810877c0/0x15ed8b9f78013cb2 lrc: 3/0,0 mode: PW/PW res: [0x85f72a:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000000020 nid: 10.151.6.134@o2ib remote: 0x810eb6843fa5e1d1 expref: 9 pid: 11647 timeout: 5261835188 lvb_type: 0 From the debug dump we seet at 151 sec before (22:18:08) the servers was running in o2iblnd_cb.c 3306 kiblnd_cq_completion()) conn[ffff880f0c5d7000] (20)++ for a long time. It is possible that the call back request was never sent?

Mahmoud Hanafi added a comment - 18/Oct/15 11:00 PM

Yesterday we had a oss that lost connection to the clustre we couldn't find any ib related issue. In the logs there are a few time where ldlm_cn threads dump call traces. for example

Oct 17 12:38:53 nbp8-oss11 kernel: LustreError: 58897:0:(ldlm_lockd.c:435:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-nbp8-OST00a6_UUID lock: ffff8807b3a6cb40/0xec5cb59ce7507b6a lrc: 2/0,0 mode: --/PW res: [0x1bbc9af:0x0:0x0].0 rrc: 5 type: EXT [311853056->18446744073709551615] (req 311853056->18446744073709551615) flags: 0x74801000000020 nid: 10.151.13.98@o2ib remote: 0x1ee2c1eb7b3a5e70 expref: 5 pid: 17316 timeout: 5055271427 lvb_type: 0
Oct 17 12:38:53 nbp8-oss11 kernel: Pid: 58897, comm: ldlm_cn00_026
Oct 17 12:38:53 nbp8-oss11 kernel:
Oct 17 12:38:53 nbp8-oss11 kernel: Call Trace:
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa054f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080986b>] ldlm_add_waiting_lock+0x1db/0x310 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080b068>] ldlm_server_completion_ast+0x598/0x770 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080aad0>] ? ldlm_server_completion_ast+0x0/0x770 [ptlrpc]
Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa07de15c>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081fb1c>] ptlrpc_set_wait+0x6c/0x860 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081b28a>] ? ptlrpc_prep_set+0xfa/0x2f0 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07de090>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e10ab>] ldlm_run_ast_work+0x1bb/0x470 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e1475>] ldlm_reprocess_all+0x115/0x300 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0802ff7>] ldlm_request_cancel+0x277/0x410 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08032cd>] ldlm_handle_cancel+0x13d/0x240 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08091c9>] ldlm_cancel_handler+0x1e9/0x500 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08390c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa05618d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0831a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083b89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083ada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]
Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Oct 17 12:38:54 nbp8-oss11 kernel:
Oct 17 12:57:00 nbp8-oss11 kernel: LNet: 3458:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.15.171@o2ib: error -116(waiting)

and again at 22:22:37
and there are call trace dump for ll_ost_io at 22:54:24.

The the major outaged occured at Oct 17 22:52:32

see attched nbp8-os11.var.log.messages.oct.17.gz
And 3 lustre debug dumps
lustre-log.1445147654.68807.gz
lustre-log.1445147717.68744.gz
lustre-log.1445147754.68673.gz

I have debug dumps for some of the clients but we didn't have net debuging enabled on them.

Mahmoud Hanafi added a comment - 18/Oct/15 11:00 PM Yesterday we had a oss that lost connection to the clustre we couldn't find any ib related issue. In the logs there are a few time where ldlm_cn threads dump call traces. for example Oct 17 12:38:53 nbp8-oss11 kernel: LustreError: 58897:0:(ldlm_lockd.c:435:ldlm_add_waiting_lock()) ### not waiting on destroyed lock (bug 5653) ns: filter-nbp8-OST00a6_UUID lock: ffff8807b3a6cb40/0xec5cb59ce7507b6a lrc: 2/0,0 mode: --/PW res: [0x1bbc9af:0x0:0x0].0 rrc: 5 type: EXT [311853056->18446744073709551615] (req 311853056->18446744073709551615) flags: 0x74801000000020 nid: 10.151.13.98@o2ib remote: 0x1ee2c1eb7b3a5e70 expref: 5 pid: 17316 timeout: 5055271427 lvb_type: 0 Oct 17 12:38:53 nbp8-oss11 kernel: Pid: 58897, comm: ldlm_cn00_026 Oct 17 12:38:53 nbp8-oss11 kernel: Oct 17 12:38:53 nbp8-oss11 kernel: Call Trace: Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa054f895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080986b>] ldlm_add_waiting_lock+0x1db/0x310 [ptlrpc] Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080b068>] ldlm_server_completion_ast+0x598/0x770 [ptlrpc] Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa080aad0>] ? ldlm_server_completion_ast+0x0/0x770 [ptlrpc] Oct 17 12:38:53 nbp8-oss11 kernel: [<ffffffffa07de15c>] ldlm_work_cp_ast_lock+0xcc/0x200 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081fb1c>] ptlrpc_set_wait+0x6c/0x860 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa081b28a>] ? ptlrpc_prep_set+0xfa/0x2f0 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07de090>] ? ldlm_work_cp_ast_lock+0x0/0x200 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e10ab>] ldlm_run_ast_work+0x1bb/0x470 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa07e1475>] ldlm_reprocess_all+0x115/0x300 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0802ff7>] ldlm_request_cancel+0x277/0x410 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08032cd>] ldlm_handle_cancel+0x13d/0x240 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08091c9>] ldlm_cancel_handler+0x1e9/0x500 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa08390c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa05618d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa0831a69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083b89d>] ptlrpc_main+0xafd/0x1780 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffffa083ada0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc] Oct 17 12:38:54 nbp8-oss11 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Oct 17 12:38:54 nbp8-oss11 kernel: Oct 17 12:57:00 nbp8-oss11 kernel: LNet: 3458:0:(o2iblnd_cb.c:1895:kiblnd_close_conn_locked()) Closing conn to 10.151.15.171@o2ib: error -116(waiting) and again at 22:22:37 and there are call trace dump for ll_ost_io at 22:54:24. The the major outaged occured at Oct 17 22:52:32 see attched nbp8-os11.var.log.messages.oct.17.gz And 3 lustre debug dumps lustre-log.1445147654.68807.gz lustre-log.1445147717.68744.gz lustre-log.1445147754.68673.gz I have debug dumps for some of the clients but we didn't have net debuging enabled on them.

Mahmoud Hanafi added a comment - 15/Oct/15 1:11 AM

I was able to getting some timing info for kiblnd_create_tx_pool. It can take on average 12-13 sec to complete the call. Most of the time is spent inside the for loop in

         LIBCFS_CPT_ALLOC(tx->tx_wrq, lnet_cpt_table(), ps->ps_cpt,
                 (1 + IBLND_MAX_RDMA_FRAGS) *
                 sizeof(*tx->tx_wrq));

Mahmoud Hanafi added a comment - 15/Oct/15 1:11 AM I was able to getting some timing info for kiblnd_create_tx_pool. It can take on average 12-13 sec to complete the call. Most of the time is spent inside the for loop in LIBCFS_CPT_ALLOC(tx->tx_wrq, lnet_cpt_table(), ps->ps_cpt, (1 + IBLND_MAX_RDMA_FRAGS) * sizeof(*tx->tx_wrq));

James A Simmons added a comment - 13/Oct/15 8:20 PM

Outside of that bug does it resolve your issues?

James A Simmons added a comment - 13/Oct/15 8:20 PM Outside of that bug does it resolve your issues?

Mahmoud Hanafi added a comment - 13/Oct/15 7:06 PM

Small bug in patch http://review.whamcloud.com/16470

CDEBUG(D_NET, "ps_pool_create took %lu HZ to complete",
           cfs_time_current() - time_before);

this should end with new line.

Mahmoud Hanafi added a comment - 13/Oct/15 7:06 PM Small bug in patch http://review.whamcloud.com/16470 CDEBUG(D_NET, "ps_pool_create took %lu HZ to complete" , cfs_time_current() - time_before); this should end with new line.

Amir Shehata (Inactive) added a comment - 21/Sep/15 5:54 PM

When a pool is created enough kernel pages are allocated to cover the transmit. The number of pages is determined by the size of the tx message and the size of the page. So for example if the maximum message size is 4K. If the page size is 4K, then we would allocate one page per message. And if we are allocating 256 tx, then we'd allocate 256 pages.

In kiblnd_map_tx_pool() each tx->tx_msg is set up to point to the page allocated for that tx. Then dma_map_sing() is used to map this kernel page to a DMA address.

The tx is then added to the pool tx free list ready for use.

In the discussion previously a question arose whether pinning memory is related to a connection. As described above, it doesn't look like this is the case.

However, allocating pages and dma mapping them look like they can take time to complete as the memory pressure grows on the system, as in the case here.

Amir Shehata (Inactive) added a comment - 21/Sep/15 5:54 PM When a pool is created enough kernel pages are allocated to cover the transmit. The number of pages is determined by the size of the tx message and the size of the page. So for example if the maximum message size is 4K. If the page size is 4K, then we would allocate one page per message. And if we are allocating 256 tx, then we'd allocate 256 pages. In kiblnd_map_tx_pool() each tx->tx_msg is set up to point to the page allocated for that tx. Then dma_map_sing() is used to map this kernel page to a DMA address. The tx is then added to the pool tx free list ready for use. In the discussion previously a question arose whether pinning memory is related to a connection. As described above, it doesn't look like this is the case. However, allocating pages and dma mapping them look like they can take time to complete as the memory pressure grows on the system, as in the case here.

Amir Shehata (Inactive) added a comment - 18/Sep/15 12:35 AM

I updated the b2_5 patch to add a few more debug info:
1. print the number of pools in the pool set when allocating new pools
2. print the size of each pool
3. when you do lctl list_nids the number of cpts is printed at D_NET level.

This should clarify how many times we're allocating pools over time. Will also give us some insight into the cpt pools are getting associated with.

I have also been discussing internally the impact of Hyperthreading on performance. There is a tendency to think that it could negatively impact performance. Would it be possible to turn it off and find out if there is any performance improvement.

However I do print the number of CPTs as indicated above so that should tell us if we're considering HT as logical cores.

I have also attached the HLD for the SMP node affinity for your reference. It describes how the CPT is implemented in the system.

I'll continue looking at how memory is allocated and dma mapped and provide an explanation.

Amir Shehata (Inactive) added a comment - 18/Sep/15 12:35 AM I updated the b2_5 patch to add a few more debug info: 1. print the number of pools in the pool set when allocating new pools 2. print the size of each pool 3. when you do lctl list_nids the number of cpts is printed at D_NET level. This should clarify how many times we're allocating pools over time. Will also give us some insight into the cpt pools are getting associated with. I have also been discussing internally the impact of Hyperthreading on performance. There is a tendency to think that it could negatively impact performance. Would it be possible to turn it off and find out if there is any performance improvement. However I do print the number of CPTs as indicated above so that should tell us if we're considering HT as logical cores. I have also attached the HLD for the SMP node affinity for your reference. It describes how the CPT is implemented in the system. I'll continue looking at how memory is allocated and dma mapped and provide an explanation.

Amir Shehata (Inactive) added a comment - 17/Sep/15 6:41 PM

Jay,
I updated Liang's original check in on b2_5.

Amir Shehata (Inactive) added a comment - 17/Sep/15 6:41 PM Jay, I updated Liang's original check in on b2_5.

Jay Lan (Inactive) added a comment - 17/Sep/15 6:09 PM - edited

Amir,

Your patch was generated from master branch and caused conflict in 2.5.3:

<<<<<<< HEAD
cfs_list_t *node;
kib_pool_t *pool;
int rc;
=======
struct list_head *node;
kib_pool_t *pool;
int rc;
unsigned int interval = 1;
cfs_time_t time_before;
unsigned int trips = 0;
>>>>>>> 9be6d5c... ~~LU-7054~~ o2iblnd: less intense allocating retry

I need to change "struct list_head" to "cfs_list_t".

Also, interval and trips do not need to be 'unsigned int'. Actually they were used as int in CDEBUG.

Jay Lan (Inactive) added a comment - 17/Sep/15 6:09 PM - edited Amir, Your patch was generated from master branch and caused conflict in 2.5.3: <<<<<<< HEAD cfs_list_t *node; kib_pool_t *pool; int rc; ======= struct list_head *node; kib_pool_t *pool; int rc; unsigned int interval = 1; cfs_time_t time_before; unsigned int trips = 0; >>>>>>> 9be6d5c... LU-7054 o2iblnd: less intense allocating retry I need to change "struct list_head" to "cfs_list_t". Also, interval and trips do not need to be 'unsigned int'. Actually they were used as int in CDEBUG.

Gerrit Updater added a comment - 17/Sep/15 5:37 PM

Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/16470
Subject: ~~LU-7054~~ o2iblnd: less intense allocating retry
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8861aae7ebdcc564fa47cd84ace253e62bafef4e

Gerrit Updater added a comment - 17/Sep/15 5:37 PM Amir Shehata (amir.shehata@intel.com) uploaded a new patch: http://review.whamcloud.com/16470 Subject: LU-7054 o2iblnd: less intense allocating retry Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 8861aae7ebdcc564fa47cd84ace253e62bafef4e

People

Assignee:: Amir Shehata (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 28/Aug/15 12:05 AM

Updated:: 12/May/16 12:24 PM

Resolved:: 27/Jan/16 1:04 AM