Some databases provide the ability to write custom functions that can be accessed from SQL queries. Usually, this feature goes along with a stored procedures capability. A few databases allows creating user defined aggregate functions.
What are user defined aggregate functions? Functions such as count, sum, min and max are all aggregate functions because they operate on multiple rows and provides a single result. There may be times when the default aggregate functions provided by the database are usually not sufficient. In that case, using the “user defined aggregate functions” feature, it’s possible to write your own aggregate function with whatever semantics you want.
It is important to understand the computational model of user defined aggregate functions. First, aggregates happen with or without a grouping. That is
select count(1) from people where country = 'US'
gives the total count of people within the US. Here, there is only one aggregate computation. However,
select country,count(1) from people group by country
gives the total count of people by each country. Here, the aggregates are computed for each country.
In which type of SQL your user defined aggregate function is used should not matter, except for initialization. Before understanding what I mean by that, first, the aggregate computation does not happen on the entire set of rows in a group at once. I will illustrate this with our own user defined aggregate function called pcount. pcount does the same thing that count does.
So, the signature of pcount when you define it as a function, will not be something like
int pcount(object[] group-of-rows);
but it would be something like
int pcount(object val);
So, the expectation is that you keep receiving the data of each row, one after the other and at the end you return the result. So, the following things have to happen for user defined aggregate functions.
- Initialization
- Iteration
- Final Result
Now, since the function is repeatedly iterated over the list of rows within a group, where do you keep any intermediate state such as the number of rows counted so far? For this reason, it is important that the stored procedures language offers some type of object orientation. With object orientation, it’s possible to create an object (using new or whatever constructor mechanism), then loop through the rows calling a member function of the object and any intermediate state can be captured within the object and finally, call a results function to get the final result. In case of a SQL that returns multiple groups (with a group by syntax), it is possible to use the same object or one object per grouping, the implementation may depend on the database. Depending on that, it may be required to write some initialization code so that the state from the previous group calculations are not compounded.
So, the pcount function of ours, will simply be part of an object as a member function and store the count value in a member variable and keeps incrementing the count each time the function is called (ofcourse, it would ignore the null values and not count them).
At the end, the member variable that is keeping the tally of the count is simply returned.
Note that, by default the aggregate functions don’t get the context of the grouping. In our example, the function is not aware for which country the aggregate is being calculated. In most cases this will not be needed either. That is, your count or sum or any other aggregate function value wouldn’t really depend on for which group you are calculating the aggregate. However, if desired, it should be possible as yet another function parameter. Also, note that it’s possible to pass multiple multiple arguments to your aggregate function. Each parameter would typically be a non-grouping column of the query.