Not many people, including a few reasonably smart ones, understand how exactly materialized views work and what makes certain types of materialized views fast refreshable. Not understanding how they don’t work is ok, but expecting them to work for every situation and trivializing the problem at hand to simply throwing a materialized view into the design is what prompts me to write about materialized views. So, let me try to explain them without using technical jargon.
First a few definitions.
Materialized Views are database objects that are like views in the sense that their definition is based on a SQL and also like tables in the sense that the result of the SQL is actually materialized (stored) and hence the name materialized views.
The next question is, if the SQL results are actually materialized, how are these results kept in sync when the underlying data changes? This is where, the idea of refreshing a materialized view comes into picture. Obviously the choice is either to keep updating the materialized data either instantaneously as the underlying data in the SQL is changed or to do it periodically. So, those that can be refreshed instantaneously are refreshed on commit. Otherwise, they have to be refreshed in deferred mode by explicitly calling some APIs.
Note though, only certain type of materialized views can be refreshed fast. We will get to the details soon. But suffice it to say, if a materialized view can’t be refreshed fast, then it has to be refreshed full and that means, such materialized views can’t be refreshed on commit.
Ok, now let’s get to the details. Any SQL query that can be maintained incrementally can be fast refreshable. Otherwise, it has to be refreshed full. Now I will give a real example to explain this.
Say a university professor is meeting each freshman student on the first day and trying to gather the following statistics. The min, max, average SAT scores along with the total number of students. Now, say that the prof had only a small sheet of paper just enough to write a handful of numbers while the number of students joining is in thousands. So, instead of writing down the scores, the prof choose to just maintain these metrics.
So, the first student came and the score is 2000. So, the min, max and avg are 2000 and the count is 1.
Second student score is 2100. The min remains 2000. max is 2100. avg is 2050. count is 2.
Third student score is 2200. The min remains 2000. max is 2200. avg is …
oops, how to find the average? It’s simple. Take the previous average of 2050 and the count of 2, that gives the total as 4100. Now add the 2200 and divide by 3. The result is 2100.
So, to find the average based on the 3rd student, it’s not necessary to go back to the first two students and get their scores. Just the current average can be used (along with the current count) to update the average based on the next available score.
Any piece of information that can be computed incrementally, based on the existing and computed information and the new data, is fast refreshable. Since in real world applications, it’s not always new data, but also updates and even deletes, it’s important to consider all the cases.
Going back to the example, the professor can keep on updating the metrics as new students are coming in and giving their SAT scores.
Now say, after some 257 students, one of the earlier students comes back says “Professor, sorry I made a mistake. My score is not 2000, but it’s 2010”. What then? How does the professor update his metrics? If the minimum by that time happens to be 1900, then there is no change required to the minimum. But what happens if the minimum happened to be 2000? Can the professor update it to 2010 since the student with the min score changed from 2000 to 2010? The answer is no. This is because, among the 257 students, if any of them had a score between 2000 and 2009, there is no way to immediately conclude that 2010 is the minimum score. What this means is, the professor is not in a position to immediately figure out the minimum score based on just the current minimum score and an update to one of the existing scores which happened to be the minimum. Thing is, with metrics like min and max, some times they can be derived based on the existing info (if the value being changed happened to be neither the current min or the current max) and sometimes it’s not possible. However, the count remains the same and the average can still be maintained accurately. All it needs is to figure out the total sum by multiplying the current average with the current count and then adding 10 (2010-2000) and then again dividing using the current count. Ofcourse, as there is a chance for losing precision when dividing and multiplying it’s important to keep the SUM and COUNT metrics rather than the AVG directly.
So, the above example should give an idea of what we mean by being able to manage a metric incrementally. When information is created, updated or deleted, if it’s possible to update the metric based on the current value of the metric and the delta changes, then it’s possible to fast refresh such metrics. This is one of the reasons why SQLs that contain transient values such as current time (sysdate) can’t be fast refreshed because as time passes, infact without any change to the underlying data, the metric is constantly changing. Outer joins also have similar issues.
Another of my favorite examples where it’s not possible to refresh fast is the count distinct. This is not possible because, going back to the earlier example of SAT scores, if the professor is keeping track of the count of the distinct scores, when a new score comes, how does he know if it’s a value that’s already been factored in or not unless the list of all the distinct values are also maintained? See, it’s not that hard to understand the fast refreshable capability of materialized views, if you assume that there is limited resources to keep track of the metric and if the metric can be tracked in various scenarios like create, update and delete without referring back to the past data, then it’s possible.