[pve-devel] 3 numa topology issues

Thu Jul 28 08:44:47 CEST 2016

I'm looking at openstack implementation

https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html

and it seem that they check if host numa nodes exist too

"hw:numa_nodes=NN - numa of NUMA nodes to expose to the guest.
The most common case will be that the admin only sets ‘hw:numa_nodes’ and then the flavor vCPUs and RAM will be divided equally across the NUMA nodes.
"

This is what we are doing with numa:1.  (we use sockets to known how many numa nodes we need)

" So, given an example config:

vcpus=8
mem=4
hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
hw:numa_cpus.0=0,1,2,3,4,5
hw:numa_cpus.1=6,7
hw:numa_mem.0=3072
hw:numa_mem.1=1024
The scheduler will look for a host with 2 NUMA nodes with the ability to run 6 CPUs + 3 GB of RAM on one node, and 2 CPUS + 1 GB of RAM on another node. If a host has a single NUMA node with capability to run 8 CPUs and 4 GB of RAM it will not be considered a valid match.
"

So, if host don't have enough numa nodes, it's invalid

----- Mail original -----
De: "aderumier" <aderumier at odiso.com>
À: "Wolfgang Bumiller" <w.bumiller at proxmox.com>
Cc: "pve-devel" <pve-devel at pve.proxmox.com>
Envoyé: Mercredi 27 Juillet 2016 11:38:04
Objet: Re: [pve-devel] 3 numa topology issues

>>I believe we can simply remove this line since qemu allows it and just 
>>applies its default policy. Alternatively we can keep a counter and 
>>apply host-nodes manually, starting over at 0 when we run out of nodes, 
>>but that's no better than letting qemu do this. 

Well, I don't known how auto numa_balancing is working on host, when for example, 

a guest define 2 numa nodes and host have only 1 numa node. 

I'll have more time next week to do a lot of tests 

----- Mail original ----- 
De: "Wolfgang Bumiller" <w.bumiller at proxmox.com> 
À: "aderumier" <aderumier at odiso.com> 
Cc: "pve-devel" <pve-devel at pve.proxmox.com> 
Envoyé: Mercredi 27 Juillet 2016 09:16:07 
Objet: Re: 3 numa topology issues 

> On July 26, 2016 at 2:18 PM Alexandre DERUMIER <aderumier at odiso.com> wrote: 
> 
> 
> > >>Issue #1: The above code currently does not honor our 'hostnodes' option 
> > >>and breaks when trying to use them together. 
> 
> Also I need to check how to allocated hugepage, when hostnodes is defined with range like : "hostnodes:0-1". 
> 
> 
> 
> 
> 
> >>Useless, yes, which is why I'm wondering whether this should be 
> >>supported/warned about/error... 
> 
> I think we could force to define "hostnodes". 
> I don't known if a lot of people already use numaX option, but as we never exposed it in GUI, i don't think it could break setup of too many people. 
> 
> 
> 
> 
> 
> ----- Mail original ----- 
> De: "Wolfgang Bumiller" <w.bumiller at proxmox.com> 
> À: "aderumier" <aderumier at odiso.com> 
> Cc: "pve-devel" <pve-devel at pve.proxmox.com> 
> Envoyé: Mardi 26 Juillet 2016 13:59:42 
> Objet: Re: 3 numa topology issues 
> 
> On Tue, Jul 26, 2016 at 01:35:50PM +0200, Alexandre DERUMIER wrote: 
> > Hi Wolfgang, 
> > 
> > I just come back from holiday. 
> 
> Hope you had a good time :-) 
> 
> > 
> > 
> > 
> > >>Issue #1: The above code currently does not honor our 'hostnodes' option 
> > >>and breaks when trying to use them together. 
> > 
> > mmm indeed. I think this can be improved. I'll try to check that next week. 
> > 
> > 
> > 
> > >>Issue #2: We create one node per *virtual* socket, which means enabling 
> > >>hugepages with more virtual sockets than physical numa nodes will die 
> > >>with the error that the numa node doesn't exist. This should be fixable 
> > >>as far as I can tell, as nothing really prevents us from putting them on 
> > >>the same node? At least this used to work and I've already asked this 
> > >>question at some point. You said the host kernel will try to map them, 
> > >>yet it worked without issues before, so I'm still not sure about this. 
> > >>Here's the conversation snippet: 
> > 
> > you can create more virtual numa node than physical, only if you don't define "hostnodes" option. 
> > 
> > (from my point of vue, it's totally useless, as the whole point of numa option is to map virtual node to physical node, to avoid memory access bottleneck) 
> 
> Useless, yes, which is why I'm wondering whether this should be 
> supported/warned about/error... 
> 
> > 
> > if hostnodes is defined, you need to have physical numa node available (vm with 2 numa node need host with 2 numa node) 
> > 
> > With hugepage enabled, I have added a restriction to have hostnode defined, because you want to be sure that memory is on same node. 
> > 
> > 
> > # hostnodes 
> > my $hostnodelists = $numa->{hostnodes}; 
> > if (defined($hostnodelists)) { 
> > my $hostnodes; 
> > foreach my $hostnoderange (@$hostnodelists) { 
> > my ($start, $end) = @$hostnoderange; 
> > $hostnodes .= ',' if $hostnodes; 
> > $hostnodes .= $start; 
> > $hostnodes .= "-$end" if defined($end); 
> > $end //= $start; 
> > for (my $i = $start; $i <= $end; ++$i ) { 
> > die "host NUMA node$i doesn't exist\n" if ! -d "/sys/devices/system/node/node$i/"; 
> > } 
> > } 
> > 
> > # policy 
> > my $policy = $numa->{policy}; 
> > die "you need to define a policy for hostnode $hostnodes\n" if !$policy; 
> > $mem_object .= ",host-nodes=$hostnodes,policy=$policy"; 
> > } else { 
> > die "numa hostnodes need to be defined to use hugepages" if $conf->{hugepages}; 
> > } 
> > 
> > 
> > >>Issue #3: Actually just an extension to #2: we currently cannot enable 
> > >>NUMA at all (even without hugepages) when there are more virtual sockets 
> > >>than physical numa nodes, and this used to work. The big question is 
> > >>now: does this even make sense? Or should we tell users not to do this? 
> > 
> > That's strange, it should work if you don't defined hugepages and hostnodes option(in numaX) 
> 
> Actually this one was my own faulty configuration, sorry. 

Gotta take that back, here's the problem: 
sockets: 2 
numa: 1 
(no numaX defined) 

will go through Memory.pm's sub config: 

| if ($conf->{numa}) { 
| 
| my $numa_totalmemory = undef; 
| for (my $i = 0; $i < $MAX_NUMA; $i++) { 
| next if !$conf->{"numa$i"}; 
(...) 
| } 
| 
| #if no custom tology, we split memory and cores across numa nodes 
| if(!$numa_totalmemory) { 
| 
| my $numa_memory = ($static_memory / $sockets); 
| 
| for (my $i = 0; $i < $sockets; $i++) { 
| die "host NUMA node$i doesn't exist\n" if ! -d "/sys/devices/system/node/node$i/"; 

and dies there if no numa node exists. 

I believe we can simply remove this line since qemu allows it and just 
applies its default policy. Alternatively we can keep a counter and 
apply host-nodes manually, starting over at 0 when we run out of nodes, 
but that's no better than letting qemu do this. 

_______________________________________________ 
pve-devel mailing list 
pve-devel at pve.proxmox.com 
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel