[pve-devel] question/idea : managing big proxmox cluster (100nodes), get rid of corosync ?

Thomas Lamprecht t.lamprecht at proxmox.com
Wed Sep 21 12:55:21 CEST 2016


On 09/21/2016 10:51 AM, Alexandre DERUMIER wrote:
>>> Note that I have around 1000vms, so I don't known impact of number of messages/s.
> a simple tcpdump give me an average of:
>
> udp/5404: 500packets/s
> udp/5405 : 1300 packets/s
>
> ----- Mail original -----
> De: "Alexandre Derumier" <aderumier at odiso.com>
> À: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 21 Septembre 2016 09:57:42
> Objet: Re: [pve-devel] question/idea : managing big proxmox cluster (100nodes), get rid of corosync ?
>
>>> @Alexandre, you say that with 16 nodes the cluster is quite at is maximum,
>>> can I get some more infos from you as I currently do not have the
>>> hardware to
>>> test this :)
>>>
>>> Do you use IGMP snooping/queriers?
>>> On which network communicates corosync, on an independent? And how fast
>>> is it?
>>> Redundant rings also?
> I have a full 2x10gb network through lacp (no Redundant ring).
> Dedicated vlan for nodes, but sharing same physical links (but far to be saturated)
> Cluster node are 2x10cores 3,1ghz xeon, with ssd for local storage
> currently mtu 1500, but I'm planning to increase it to 9000, as it seem that allow more messages.
> I'm using igmp snooping/queriers (multicast stable).

OK, multicast traffic may still be hindered when on the same network with
heavy users (e.g. VM storage), even if the network itself is not saturated.

A second totem ring through the redundant ring protocol (rrp) in passive
mode could boost the performance as it almost doubles the speed of the totem
protocol, plus it adds redundancy for quorum.

>
> and I'm seeing a lot of retransmit, time to time (around 5-10s of retransmit), 1 or twice by hour :/

Hmm sounds a bit weird. Seemingly random?

>
> so I'm really scared to increase the cluster size.
>
> Note that I have around 1000vms, so I don't known impact of number of messages/s.
>
> Question : do you think streaming all vm statistics could impact number of message/s ?
>


Do you use something which could trigger frequent writes/modifies on 
/etc/pve ?

Just running VMs normally does not modifies anything, there are mostly just
reads which should not cause any problems as they won't go over the wire and
are also fast from the DB as its in RAM, only modifications have to be send
to other nodes.

You could look if
# inotifywait -e attrib,modify,create,delete,move -r -m /etc/pve/
generates a lot of output, this is just the info how the local node modifies
the pmxcfs, not all.

FYI, the HA manager uses it frequently but in a 5 seconds cycle, so not 
really
heavy usage.

Can you also send me the output from
# corosync-cmapctl

This is quite some data and contains IP addresses so you maybe want to sent
it to me directly.


>
>
>
>
> ----- Mail original -----
> De: "Thomas Lamprecht" <t.lamprecht at proxmox.com>
> À: "pve-devel" <pve-devel at pve.proxmox.com>
> Envoyé: Mercredi 21 Septembre 2016 09:40:01
> Objet: Re: [pve-devel] question/idea : managing big proxmox cluster (100nodes), get rid of corosync ?
>
> On 09/21/2016 08:50 AM, Alexandre DERUMIER wrote:
>>>> Forgot to mention that consul supports multiple clusters and/or multi
>>>> center clusters out of the box.
>> yes, I read the doc yesterday. seem very interesting.
>>
>> The most work could be to replace pmxcs by consul kv store. I have seen some consul fuse fs implementation,
>> but it don't have all pmxcs features (like symlinks for example).
>>
>> Zookeeper seem to be lower level.
>>
>> reading sheedog plugin:(1500loc)
>>
>> https://github.com/sheepdog/sheepdog/blob/8772904509ce6b10c5edca4f497022686aecc18f/sheep/cluster/zookeeper.c
>> vs
>> https://github.com/sheepdog/sheepdog/blob/8772904509ce6b10c5edca4f497022686aecc18f/sheep/cluster/corosync.c
> Discussion and evaluating options is good but throwing instantly all away,
> and switching to another - not necessarily better - cluster stack is
> maybe a bit overreacted. :) I also think that our current cluster stack,
> with corosync + pve-cluser (pmxcfs) is quite stable and a lot of things
> depend on it.
>
> Also corosync is very well tested software and works really good, at least
> with small to mid size clusters (< 60 nodes - which I find is quite an
> achievement for a cluster!). You have also to consider
> that quite some overhead, and thus node limitation, may come from the
> database used by pmxcfs, the transaction needs to be synced with disk to
> make everything reliable and while this is quite optimized it makes things
> slower (placing the DB on really fast storage could help here).
>
> I, personally, would prefer to keep corosync and introduce a protocol which
> allows connecting multiple clusters (easier said, but still less change and
> work then adapting to another cluster stack, which is most surely not
> better, or has other drawbacks.)
>
> Also taking a look at the corosync satellite approach sounds interesting.
>
> Connecting multiple clusters is also another approach then a small cluster
> with a lot of satellite nodes per cluster node, I see the former better as
> its more decentralized and seems to fit netter in our current design. :)
>
>> Note that for scaling, zookeeper,consul,... have some kind of master nodes for the quorum, and client nodes. (same than corosync satelitte).
>> I don't think it's technically possible to scale with full mesh masters nodes with lot of nodes.
> No, with full mesh you wont really overcome the limits and problems corosync
> has here, corosync utilizes the possibilities quite well with multicast
> here.
>
> @Alexandre, you say that with 16 nodes the cluster is quite at is maximum,
> can I get some more infos from you as I currently do not have the
> hardware to
> test this :)
>
> Do you use IGMP snooping/queriers?
> On which network communicates corosync, on an independent? And how fast
> is it?
> Redundant rings also?
>
>
>> ----- Mail original -----
>> De: "datanom.net" <mir at datanom.net>
>> À: "pve-devel" <pve-devel at pve.proxmox.com>
>> Envoyé: Mercredi 21 Septembre 2016 07:49:06
>> Objet: Re: [pve-devel] question/idea : managing big proxmox cluster (100nodes), get rid of corosync ?
>>
>> On Wed, 21 Sep 2016 01:45:18 +0200
>> Michael Rasmussen <mir at datanom.net> wrote:
>>
>>> https://github.com/hashicorp/consul
>>>
>> Forgot to mention that consul supports multiple clusters and/or multi
>> center clusters out of the box.
>>
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
>
> _______________________________________________
> pve-devel mailing list
> pve-devel at pve.proxmox.com
> http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel





More information about the pve-devel mailing list