Performance in Big Data Land: Every CPU cycle matters

At Localytics we need to perform data aggregations at mind-blowing speeds. To do this, we run HP Vertica as a core component of our data platform.

Vertica is a cluster-based, column-oriented analytics platform that provides an ANSI SQL interface with many analytics capabilities.

My name is Konstantine Krutiy, and I lead the Vertica team at Localytics. The clusters I am responsible for contain all of our analytics data (which amount to more than 1 trillion data points). In August I presented at the HP Big Data Conference, and my session covered real-world methods for boosting query performance in Vertica. The talk was called "Extra performance out of thin air".

At the core of this talk is the concept that every CPU cycle matters. This post is the first in a series exploring this idea as it pertains to data platform performance.

Every CPU Cycle Matters

Let’s start with a simple math exercise. In our equation we have CPU power on one end and a big dataset on the other.

In terms of CPU power, the fastest enterprise-grade hardware runs at 3.7 GHz. Currently there is no numerical definition for “Big Data”, but my peers and I concede that you have “Big Data” once you cross 100 billion records.

Now we’re ready to calculate how much time we can save if we eliminate 1 CPU cycle from each iteration. In these calculations I assume that we run on one core and eliminate one CPU cycle from the operation, which will touch each record in the dataset.

We calculate one CPU cycle operation at 1 / 3,700,000,000 of a sec and multiply it by 100 Billion records.

1 / 3,700,000,000 sec X 100,000,000,000 = 27 sec

That is an oustanding result! You can clearly see that one CPU cycle can represent tens of seconds of processing time. Think about the potential here! One tiny tweak, one CPU cycle saved, and a great reduction of processing times. Let’s see how this tweak effects the real world.

Removing CPU Cycles in the Real World

Different data types will force Vertica to use a different number of CPU cycles to process a data point. To illustrate the concept above, we’ll choose a data type for numeric values (as numeric values generally have a very high cardinality so the database engine will need to touch many different values to produce the final result).

The art of choosing data types

In this exercise we have the task of creating a column to store dollar amounts from $0 to $99,999,999,999,999.9999. We will start by looking at the different ways that $395.17 can be stored in the database.

Here we will have at least 3 different options:

Store as Presentation elements Data for database generic data type Data type Vertica translates to
Money None $395.17 money numeric(18,4)
Numeric value Dollar sign 395.17 numeric numeric(37,15)
Int value (covert to cents) Dollar sign & decimal point 39517 int int

We created a billion-record data set in 3 different tables using these different types. The column data type was the only difference. We ran the same queries on each of those tables 5 times in a row.

Here are our raw findings:

![](/content/images/2015/10/image00.png)

In simpler terms (averaged, time in seconds):

![](/content/images/2015/10/image02.png)

Yes !!! We just saved 10 sec of processing time by choosing the best possible data type.

I am sure the front-end can easily add a dollar sign and divide by 100 when Integer used as data type in the database. Doing this should be faster than the 10 sec we just saved at the database level.

Follow these rules of thumb when making data type choices in Vertica:

Now I hope you’re on board with this idea that every CPU cycle matters.

Saving more than CPU Cycles by Controlling Locking Behavior

To build on this idea of savings out of nowhere (and give you a sense of what future blog posts will hold), here’s an example that saves:

If you are running Vertica version 6 and executing your SQL statements via JDBC driver with the default settings then you acquire a lock for every SQL statement. If all you do is SELECT then you do not need those extra locks. When you avoid locks you will see the benefits mentioned above.

In Vertica, locking behavior is controlled by the AUTOCOMMIT mode on session level.

If AUTOCOMMIT = ON (jdbc driver default), each statement is treated as a complete transaction. When a statement completes changes are automatically committed to the database.

When AUTOCOMMIT = OFF, the transaction continues until manually run COMMIT or ROLLBACK. The locks are kept on objects for transaction duration.

This Java snippet shows how to switch AUTOCOMMIT to OFF:

conn = DriverManager.getConnection("jdbc:vertica://DBHost:5433/MyDB", myProperties);

/* get the state of the auto commit parameter */
System.out.println("Autocommit state: " + conn.getAutoCommit());

/* Change the auto commit state to false */
conn.setAutoCommit(false);

It did not take a lot of time to implement the needed changes. Here is a screenshot from one of our monitoring solutions, which shows the impact on lock counts by setting AUTOCOMMIT to OFF.

![](/content/images/2015/10/image01.png)

This graph shows the number of locks acquired from Vertica. A lower number means we’re not spending CPU cycles acquiring them, and we’re eliminating the time spent waiting for them, handling them, and releasing them. The result of such a simple code change was quite dramatic and caused significant increase in query throughput.

Can I call you a TRUE BELIEVER now? Every CPU cycle matters.

Now, no blog post would be fun without some code!

if( you == "have comments" ) {
    log("then comment");
} else if( you == "TRUE BELIEVER" ) {
    log("reply to post 'EVERY CPU CYCLE MATTERS'");
} else if( you == "almost TRUE BELIEVER, give me more!") {
    log("reply to post 'almost EVERY CPU CYCLE MATTERS'");
} else {
    log("Goodbye! and SHARE");
}

The full slide deck from presentation on HP Big Data Conference can be found here:
http://www.slideshare.net/kkrutiy/extra-performance-out-of-thin-air

Hero image above comes from “CPU Flame Graphs” by Brendan Gregg, where he discusses this type of visualization for sampled stack traces