Category Archives: VLDB

Linux large pages and non-uniform memory distribution

In my last post about large pages in I promised a little more background information on how large pages and NUMA are related.

Background and some history about processor architecture

For quite some time now the CPUs you get from AMD and Intel both are NUMA, or better: cache coherent NUMA CPUs. They all have their own “local” memory directly attached to them, in other words the memory distribution is not uniform across all CPUs. This isn’t really new, Sequent has pioneered this concept on x86 a long time ago but that’s in a different context. You really should read Scaling Oracle 8i by James Morle which has a lot of excellent content related to NUMA in it, with contributions from Kevin Closson. It doesn’t matter that it reads “8i” most of it is as relevant today as it was then.

So what is the big deal about NUMA architecture anyway? To explain NUMA and why it is important to all of us a little more background information is on order.

Some time ago processor designers and architects of industry standard hardware could no longer ignore the fact that a front side bus (FSB) proved to be a bottleneck. There were two reasons for this: it was a) too slow and b) too much data had to go over it. As one direct consequence DRAM memory has been directly attached to the CPUs. AMD has done this first with it’s Opteron processors in its AMD64 micro architecture, followed by Intel’s Nehalem micro architecture. By removing the requirement of data retrieved from DRAM to travel across a slow bus latencies could be removed.

Now imagine that every processor has a number of memory channels to which DDR3 (DDR4 could arrive soon!) SDRAM is attached to. In a dual socket system, each socket is responsible for half the memory of the system. To allow the other socket to access the corresponding other half of memory some kind of interconnect between processors is needed. Intel has opted for the Quick Path Interconnect, AMD (and IBM for p-Series) use Hyper Transport. This is (comparatively) simple when you have few sockets, up to 4 each socket can directly connect to every other without any tricks. For 8 sockets it becomes more difficult. If every socket can directly communicate with its peers the system is said to be glue-less which is beneficial. The last production glue-less system Intel released was based on the Westmere architecture. Sandy Bridge (current until approximately Q3/2013) didn’t have an eight-way glue-less variant, and this is exactly why you get Westmere-EX in the X3-8, and not Sandy Bridge as in the X3-2.

Anyway, your system will have local and remote memory. For most of us, we are not going to notice this at all since there is little point in enabling NUMA on systems with two sockets. Oracle still recommends that you only enable NUMA on 8 way systems, and this is probably the reason the oracle-validated and preinstall RPMs add “numa=off” to the kernel command line in your GRUB boot loader.

Continue reading


Introduce Partitioning to a large table

Recently I have been asked to find a way to partition a very big table on very slow storage. Very large table translates into 600G, and slow storage translates into HP MSA 1000. The time window wasn’t big enough to run this, and the insert activity on the table required it to be done on less busy times.

DBMS_REDEFINITION wasn’t an option then so I decided on a multi-step approach. If I can’t neatly use dbms_redefinition then at least I needed the power of the partitioning commands. Here’s the proof of concept.

As alwyas, don’t simply copy and paste – this works for me but doesn’t mean it also works for you. Test, test, test first before doing this in production.

Create a “large” table as source first to test. Actually it’s not large at all but that’s besides the point…

create table unpart
tablespace users
as select * from dba_objects;

Table created.

09:53:05 SQL> select count(1) from unpart;


Now let’s put some more data into it:

 for i  in 1..5 loop  
  insert into unpart select * from unpart;
 end loop;

09:53:26 SQL> /

PL/SQL procedure successfully completed.

09:53:27 SQL> commit;

Commit complete.

09:53:30 SQL> select count(1) from unpart;


Slightly better. I now need to find the upper bound for the newly to be created partition. This is where you shouldn’t insert data into the table…

SQL> select max(object_id) from unpart;


Now create the partitioned table to hold the data after the move. The whole source table should fit into the first partition.

create table part
partition by range (object_id)
 partition p1 values less than (10459),
 partition p2 values less than (20000),
 partition p3 values less than (maxvalue)
as select * from unpart where 1 = 0;

SQL> alter table part enable row movement;

Table altered.

A few selects to prove that the table is empty:

SQL> select count(1) from part partition (p1);


SQL> select count(1) from part partition (p2)
  2  /


SQL> select count(1) from part partition (p3);


Now let’s exchange the table with the initial partition:

SQL> alter table part exchange partition p1 with table unpart;

Table altered.

SQL> select count(1) from part partition (p1)
  2  /


SQL> select count(1) from part partition (p2);


-- Now we can partition the rest!

09:56:22 SQL> alter table part split partition p1 at (5000) into (partition p_del, partition p1) 
           2  update global indexes parallel 4;

Table altered.

09:56:35 SQL> select count(1) from part partition (p_del);


09:56:46 SQL> select count(1) from part partition (p1);


09:56:50 SQL> select 157312+161376 from dual;


So we’ve got the initial number of rows, evenly split across partitions. This is great! Unfotunately there is a slight caveat with the split partition command – it’s very heavy on resources (remember the multi step approach?). Each row in the source partition will be copied to the destination partitions (new segments), causing a large number of IOs.

09:57:33 SQL> alter table part split partition p1 at (8500) into (partition p1, partition p1bis)
09:57:37   2   update global indexes parallel 4;

Dropping partitions is easy too:

SQL> alter table part drop partition p_del

Table altered.