[PVE-User] (Very) basic question regarding PVE Ceph integration

Mon Dec 17 10:22:18 CET 2018

Hello Eneko,

On Mon, Dec 17, 2018 at 09:23:36AM +0100, Eneko Lacunza wrote:
> Hi,
> 
> El 16/12/18 a las 17:16, Frank Thommen escribió:
> > > > I understand that with the new PVE release PVE hosts
> > > > (hypervisors) can be
> > > > used as Ceph servers.  But it's not clear to me if (or when) that makes
> > > > sense.  Do I really want to have Ceph MDS/OSD on the same
> > > > hardware as my
> > > > hypervisors?  Doesn't that a) accumulate multiple POFs on the
> > > > same hardware
> > > > and b) occupy computing resources (CPU, RAM), that I'd rather
> > > > use for my VMs
> > > > and containers?  Wouldn't I rather want to have a separate Ceph
> > > > cluster?
> > > The integration of Ceph services in PVE started with Proxmox VE 3.0.
> > > With PVE 5.3 (current) we added CephFS services to the PVE. So you can
> > > run a hyper-converged Ceph with RBD/CephFS on the same servers as your
> > > VM/CT.
> > > 
> > > a) can you please be more specific in what you see as multiple point of
> > > failures?
> > 
> > not only I run the hypervisor which controls containers and virtual
> > machines on the server, but also the fileservice which is used to store
> > the VM and container images.
> I think you have less points of failure :-) because you'll have 3 points
> (nodes) of failure in an hyperconverged scenario and 6 in a separate
> virtualization/storage cluster scenario...  it depends how you look at it.
> > 
> > > b) depends on the workload of your nodes. Modern server hardware has
> > > enough power to be able to run multiple services. It all comes down to
> > > have enough resources for each domain (eg. Ceph, KVM, CT, host).
> > > 
> > > I recommend to use a simple calculation for the start, just to get a
> > > direction.
> > > 
> > > In principle:
> > > 
> > > ==CPU==
> > > core='CPU with HT on'
> > > 
> > > * reserve a core for each Ceph daemon
> > >    (preferable on the same NUMA as the network; higher frequency is
> > >    better)
> > > * one core for the network card (higher frequency = lower latency)
> > > * rest of the cores for OS (incl. monitoring, backup, ...), KVM/CT usage
> > > * don't overcommit
> > > 
> > > ==Memory==
> > > * 1 GB per TB of used disk space on an OSD (more on recovery)
> Note this is not true anymore with Bluestore, because you have to add cache
> space into account (1GB for HDD and 3GB for SSD OSDs if I recall
> correctly.), and also currently OSD processes aren't that good with RAM use
> accounting... :)
I want to add, that the recommendations for luminous still state that.
Also it is more a rule of thumb, not what the usage actually will be.
http://docs.ceph.com/docs/luminous/start/hardware-recommendations/

With 12.2.10 the bluestore_cache_* settings have been replaced by
osd_memory_target and are set to 4GB by default.
http://docs.ceph.com/docs/master/releases/luminous/

> > > * enough memory for KVM/CT
> > > * free memory for OS, backup, monitoring, live migration
> > > * don't overcommit
> > > 
> > > ==Disk==
> > > * one OSD daemon per disk, even disk sizes throughout the cluster
> > > * more disks, more hosts, better distribution
> > > 
> > > ==Network==
> > > * at least 10 GbE for storage traffic (more the better),
> > >    see our benchmark paper
> > > https://forum.proxmox.com/threads/proxmox-ve-ceph-benchmark-2018-02.41761/
> 10Gbit helps a lot with latency; small clusters can work perfectly with
> 2x1Gbit if they aren't latency-sensitive (we have been running a handfull of
> those for some years now).
> 
> Cheers
> Eneko
--
Cheers,
Alwin