[PVE-User] (Very) basic question regarding PVE Ceph integration

Mon Dec 17 09:23:36 CET 2018

Hi,

El 16/12/18 a las 17:16, Frank Thommen escribió:
>>> I understand that with the new PVE release PVE hosts (hypervisors) 
>>> can be
>>> used as Ceph servers.  But it's not clear to me if (or when) that makes
>>> sense.  Do I really want to have Ceph MDS/OSD on the same hardware 
>>> as my
>>> hypervisors?  Doesn't that a) accumulate multiple POFs on the same 
>>> hardware
>>> and b) occupy computing resources (CPU, RAM), that I'd rather use 
>>> for my VMs
>>> and containers?  Wouldn't I rather want to have a separate Ceph 
>>> cluster?
>> The integration of Ceph services in PVE started with Proxmox VE 3.0.
>> With PVE 5.3 (current) we added CephFS services to the PVE. So you can
>> run a hyper-converged Ceph with RBD/CephFS on the same servers as your
>> VM/CT.
>>
>> a) can you please be more specific in what you see as multiple point of
>> failures?
>
> not only I run the hypervisor which controls containers and virtual 
> machines on the server, but also the fileservice which is used to 
> store the VM and container images.
I think you have less points of failure :-) because you'll have 3 points 
(nodes) of failure in an hyperconverged scenario and 6 in a separate 
virtualization/storage cluster scenario...  it depends how you look at it.
>
>> b) depends on the workload of your nodes. Modern server hardware has
>> enough power to be able to run multiple services. It all comes down to
>> have enough resources for each domain (eg. Ceph, KVM, CT, host).
>>
>> I recommend to use a simple calculation for the start, just to get a
>> direction.
>>
>> In principle:
>>
>> ==CPU==
>> core='CPU with HT on'
>>
>> * reserve a core for each Ceph daemon
>>    (preferable on the same NUMA as the network; higher frequency is
>>    better)
>> * one core for the network card (higher frequency = lower latency)
>> * rest of the cores for OS (incl. monitoring, backup, ...), KVM/CT usage
>> * don't overcommit
>>
>> ==Memory==
>> * 1 GB per TB of used disk space on an OSD (more on recovery)
Note this is not true anymore with Bluestore, because you have to add 
cache space into account (1GB for HDD and 3GB for SSD OSDs if I recall 
correctly.), and also currently OSD processes aren't that good with RAM 
use accounting... :)
>> * enough memory for KVM/CT
>> * free memory for OS, backup, monitoring, live migration
>> * don't overcommit
>>
>> ==Disk==
>> * one OSD daemon per disk, even disk sizes throughout the cluster
>> * more disks, more hosts, better distribution
>>
>> ==Network==
>> * at least 10 GbE for storage traffic (more the better),
>>    see our benchmark paper
>> https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
10Gbit helps a lot with latency; small clusters can work perfectly with 
2x1Gbit if they aren't latency-sensitive (we have been running a 
handfull of those for some years now).

Cheers
Eneko

-- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es