Category Archives: Advanced SQL

User Defined Aggregate Functions

Some databases provide the ability to write custom functions that can be accessed from SQL queries. Usually, this feature goes along with a stored procedures capability. A few databases allows creating user defined aggregate functions.

What are user defined aggregate functions? Functions such as count, sum, min and max are all aggregate functions because they operate on multiple rows and provides a single result. There may be times when the default aggregate functions provided by the database are usually not sufficient. In that case, using the “user defined aggregate functions” feature, it’s possible to write your own aggregate function with whatever semantics you want.

It is important to understand the computational model of user defined aggregate functions. First, aggregates happen with or without a grouping. That is

select count(1) from people where country = 'US'

gives the total count of people within the US. Here, there is only one aggregate computation. However,

select country,count(1) from people group by country

gives the total count of people by each country. Here, the aggregates are computed for each country.

In which type of SQL your user defined aggregate function is used should not matter, except for initialization. Before understanding what I mean by that, first, the aggregate computation does not happen on the entire set of rows in a group at once. I will illustrate this with our own user defined aggregate function called pcount. pcount does the same thing that count does.

So, the signature of pcount when you define it as a function, will not be something like

int pcount(object[] group-of-rows);

but it would be something like

int pcount(object val);

So, the expectation is that you keep receiving the data of each row, one after the other and at the end you return the result. So, the following things have to happen for user defined aggregate functions.

  1. Initialization
  2. Iteration
  3. Final Result

Now, since the function is repeatedly iterated over the list of rows within a group, where do you keep any intermediate state such as the number of rows counted so far? For this reason, it is important that the stored procedures language offers some type of object orientation. With object orientation, it’s possible to create an object (using new or whatever constructor mechanism), then loop through the rows calling a member function of the object and any intermediate state can be captured within the object and finally, call a results function to get the final result. In case of a SQL that returns multiple groups (with a group by syntax), it is possible to use the same object or one object per grouping, the implementation may depend on the database. Depending on that, it may be required to write some initialization code so that the state from the previous group calculations are not compounded.

So, the pcount function of ours, will simply be part of an object as a member function and store the count value in a member variable and keeps incrementing the count each time the function is called (ofcourse, it would ignore the null values and not count them).
At the end, the member variable that is keeping the tally of the count is simply returned.

Note that, by default the aggregate functions don’t get the context of the grouping. In our example, the function is not aware for which country the aggregate is being calculated. In most cases this will not be needed either. That is, your count or sum or any other aggregate function value wouldn’t really depend on for which group you are calculating the aggregate. However, if desired, it should be possible as yet another function parameter. Also, note that it’s possible to pass multiple multiple arguments to your aggregate function. Each parameter would typically be a non-grouping column of the query.

Leave a comment

Filed under Advanced SQL

Multi-Table Insert

Today, I was trying to help a friend in migrating data from one data model to another. The first data model had all the levels of the dimension in the same table while the second data model had them in different tables. So, instead of writing some complex procedural logic, I tried using Oracle’s Multi-Table Insert which essentially has the following syntax

insert all
when then into <table> values (…)
when then …
select … from where … connect by … starts with … order by level;

The data models in this specific architecture are auto generated and have the foreign key constraints automatically created. As a result, I tried to order the data by the level of the dimension hierarchy so that the parent level is always inserted before the child level. However, the above sql still gave constraint violation error.

After a bit of a research, apparently, Oracle doesn’t guarantee the order of inserts in spite of the explicit ordering in the select clause. Well, I fixed the problem in a quick and easy manner using a pl/sql block with looping on the level from 1..n.

Anyway, I think the Multi-Table Insert is a cool feature but given the fact that this has been designed specifically keeping ETL in mind, it would be great if this can be enhanced to honor the ordering of the data.

Leave a comment

Filed under Advanced SQL, data migration, ETL, Oracle