Monthly Archives: October 2007

CODE IS POETRY

I just realized at the bottom of WordPress.org that they have “Code Is Poetry”. Well, that’s the very reason I have had this account for more than a year! Good to know it’s not just me who thinks good code is like a good poem!

Leave a comment

Filed under Coding, Programming, Wordpress

My Blog Is Now A PageRank 5 Blog

Just noticed that my pagerank went up from 4 to 5. This has been achieved after 100+ blogs and 1.5yrs, 16,500+ visits. This month on, I have been averaging 100 visists/day.

I wonder if it’s possible to move to a pagerank of 6 and how long it would take if I were ever to reach there! But then who knows? With recent Google’s algorithm changes, some of the sites got their pagerank reduced. So, instead of reaching to 6, I may go down as well in the future! But for now, this is an enjoying moment!

Leave a comment

Filed under pagerank

LWP::UserAgent And Fetching UTF8 encoded XML (RSS/Atom Feeds)

LWP::UserAgent is supposed to be smart enough to parse the html headers (by default) and figure out the encoding. However, some of the xml responses out there don’t pass the right header information. Instead, they rely on the encoding attribute of the xml. For example


<?xml version="1.0" encoding="utf-8"?>

It could be utf-8 or some other encoding. So, one way to deal with this is using the following code


my $content = $response->content;
my $encoding = 'utf8'; # assume this is the default
if($content =~ /encoding="([^"]+)"/) {
$encoding = $1;
}
$content = $response->decoded_content((charset => $encoding));

That’s pretty much it.

Leave a comment

Filed under Tech - Tips

Inline/Embedded XML in HTML

Ajax is cool, but in scenarios where it’s not needed, there is no need to use it. For example, if there is some XML data that is being rendered on the UI using Javascript, if the XML data is static, then there is no need to first load the page and then use the XMLHTTPRequest object (or Prototype.js) to get the XML and then use it.

Instead, it’s possible to simply embed the XML into the HTML document itself. What’s the benefit of this? It simply avoids an extra round-trip to the server. However, doing this required a bit of research and the typical differences with Firefox and IE. So, here is what I had researched and did to make it work in both cases.

IE supports xml tag that can be used to embed XML into a HTML page. In Firefox, simply embed the entire content in the same xml tag. However, an extra piece of code is needed to make it work in Firefox.

The syntax is


<xml id='xmldata' style='display:none;'>
any piece of xml
</xml>

The style=’display:none’ is needed in Firefox. Otherwise, any content within this tag is going to be displayed. Finally, in the code following javascript is required

var xml = document.getElementById(‘xmldata’);
if(xml.documentElement == null)
xml.documentElement = xml.firstChild; /* This is required for Firefox. Make sure there is no gap between the xml tag and the root node */

That’s it!

3 Comments

Filed under AJAX, javascript

Text Link Ads & Google

My SEO knowledge is limited, yet I could notice an important factor with Google. Google seems to give importance to the content that’s most recent. Infact, the recency seems to have a bit more weightage than the page rank. For example, when a fresh page is created that is optimized for a given keyword, a website (home page rank) that has a pagerank of 3 easily beat those that are as high as with pagerank 6 and 7. But as time passes, the page keeps getting demoted eventually returning to it’s right place considering the pagerank only effect.

So, essentially, Google’s search results seem to be based on

Freshness(page) + PR(page)

For the fresh pages for which I have seen this correlation, I had also had a Pagerank 4 based blog discuss it. So, it’s not clear if the Freshness is of the page in itself or if it’s related to the freshness of the link to the page. If it’s the later, then it’s important to realize that, since Google takes time to find your new page, index it, gradually promote it from it’s sandbox to the main search servers, there is a lag. That means, if you advertise a text link ad on a website, the result of it could be felt 10 to 30 days later.

Also, it’s unclear on the adverse impact of the page’s score if an inlink is removed. That is, I am not familiar if Google would take note of drop in the inlinks and immediately propagate it to the main index or if it waits for a while. If it drops immediately, then it’s better to extend the text link ad service beyond the one month. Essentially, in that case, set a side budget for 2 months and expect good results for only 50 to 30 days. So, even though the text link ads prove to be twice as much expensive, if it’s helping the site, then it’s worth it. I see that a lot of affiliates can benefit from this type of behavior of Google.

BTW, I recently read that Google no longer gives credit to domain names that contain ‘-‘ in them that are created specifically for SEO. I wonder if using an ‘_’ and in the url, for example, cordless power tools gets any extra credit or not.

Leave a comment

Filed under Google, text link ads

Fast Refreshable Materialized Views

Not many people, including a few reasonably smart ones, understand how exactly materialized views work and what makes certain types of materialized views fast refreshable. Not understanding how they don’t work is ok, but expecting them to work for every situation and trivializing the problem at hand to simply throwing a materialized view into the design is what prompts me to write about materialized views. So, let me try to explain them without using technical jargon.

First a few definitions.

Materialized Views are database objects that are like views in the sense that their definition is based on a SQL and also like tables in the sense that the result of the SQL is actually materialized (stored) and hence the name materialized views.

The next question is, if the SQL results are actually materialized, how are these results kept in sync when the underlying data changes? This is where, the idea of refreshing a materialized view comes into picture. Obviously the choice is either to keep updating the materialized data either instantaneously as the underlying data in the SQL is changed or to do it periodically. So, those that can be refreshed instantaneously are refreshed on commit. Otherwise, they have to be refreshed in deferred mode by explicitly calling some APIs.

Note though, only certain type of materialized views can be refreshed fast. We will get to the details soon. But suffice it to say, if a materialized view can’t be refreshed fast, then it has to be refreshed full and that means, such materialized views can’t be refreshed on commit.

Ok, now let’s get to the details. Any SQL query that can be maintained incrementally can be fast refreshable. Otherwise, it has to be refreshed full. Now I will give a real example to explain this.

Say a university professor is meeting each freshman student on the first day and trying to gather the following statistics. The min, max, average SAT scores along with the total number of students. Now, say that the prof had only a small sheet of paper just enough to write a handful of numbers while the number of students joining is in thousands. So, instead of writing down the scores, the prof choose to just maintain these metrics.

So, the first student came and the score is 2000. So, the min, max and avg are 2000 and the count is 1.
Second student score is 2100. The min remains 2000. max is 2100. avg is 2050. count is 2.
Third student score is 2200. The min remains 2000. max is 2200. avg is …

oops, how to find the average? It’s simple. Take the previous average of 2050 and the count of 2, that gives the total as 4100. Now add the 2200 and divide by 3. The result is 2100.

So, to find the average based on the 3rd student, it’s not necessary to go back to the first two students and get their scores. Just the current average can be used (along with the current count) to update the average based on the next available score.

Any piece of information that can be computed incrementally, based on the existing and computed information and the new data, is fast refreshable. Since in real world applications, it’s not always new data, but also updates and even deletes, it’s important to consider all the cases.

Going back to the example, the professor can keep on updating the metrics as new students are coming in and giving their SAT scores.

Now say, after some 257 students, one of the earlier students comes back says “Professor, sorry I made a mistake. My score is not 2000, but it’s 2010”. What then? How does the professor update his metrics? If the minimum by that time happens to be 1900, then there is no change required to the minimum. But what happens if the minimum happened to be 2000? Can the professor update it to 2010 since the student with the min score changed from 2000 to 2010? The answer is no. This is because, among the 257 students, if any of them had a score between 2000 and 2009, there is no way to immediately conclude that 2010 is the minimum score. What this means is, the professor is not in a position to immediately figure out the minimum score based on just the current minimum score and an update to one of the existing scores which happened to be the minimum. Thing is, with metrics like min and max, some times they can be derived based on the existing info (if the value being changed happened to be neither the current min or the current max) and sometimes it’s not possible. However, the count remains the same and the average can still be maintained accurately. All it needs is to figure out the total sum by multiplying the current average with the current count and then adding 10 (2010-2000) and then again dividing using the current count. Ofcourse, as there is a chance for losing precision when dividing and multiplying it’s important to keep the SUM and COUNT metrics rather than the AVG directly.

So, the above example should give an idea of what we mean by being able to manage a metric incrementally. When information is created, updated or deleted, if it’s possible to update the metric based on the current value of the metric and the delta changes, then it’s possible to fast refresh such metrics. This is one of the reasons why SQLs that contain transient values such as current time (sysdate) can’t be fast refreshed because as time passes, infact without any change to the underlying data, the metric is constantly changing. Outer joins also have similar issues.

Another of my favorite examples where it’s not possible to refresh fast is the count distinct. This is not possible because, going back to the earlier example of SAT scores, if the professor is keeping track of the count of the distinct scores, when a new score comes, how does he know if it’s a value that’s already been factored in or not unless the list of all the distinct values are also maintained? See, it’s not that hard to understand the fast refreshable capability of materialized views, if you assume that there is limited resources to keep track of the metric and if the metric can be tracked in various scenarios like create, update and delete without referring back to the past data, then it’s possible.

6 Comments

Filed under materialized views

Rescuing The Trapped Miners

I am happy that the recent trapped miners incident in South Africa ended with no casualties. However, one thing that came to my mind is, when there are 3200 people to be rescued, how do they prioritize in which order to rescue them? What if some people want to get out first while there are more sick people? Remember the Titanic?

So, one possible solution to this is to issue number cards to each miner periodically based on their overall health condition. So, an elder person or a person with a medical condition would get a smaller number (higher priority) than the others. This way, instead of people fighting out a mile below on who should be sent up first, it would be already pre-decided. Ofcourse, it’s quite possible that the exact incident could make the pre-determined prioritization not exactly valid. However, as the prioritization scheme becomes more accurate based on real data, the exceptions might be fewer and hopefully it’s lot more easier to resolve those fewer exceptions than everyone trying to fight out who should come out first.

There seems to be so many mining incidents through out the world these days. Hopefully, there will be more technology used to make it more safer and also protocols developed in what should be done and how when such an incident happens. The two top most areas that need more research is fail-safe communication and auxiliary supply of oxygen.

Leave a comment

Filed under Uncategorized