Category Archives: regular expressions

User Input Sanitizing Before Using In Regular Expression

Say you want to take the input from a user and filter any set of text (titles/descriptions) and show the reduced list to the user. Assuming the users are not sophisticated enough to write regular expressions, the requirement is simply to ensure that whatever the user typed in is available some where within the text.

A simple way to do this is just look for the exact substring and see if the index is greater than or equal to 0 (something like str.indexOf(input) >= 0).

Say, you want to return a match that is case-insensitive. Then the substring approach will not work. So, you need to jump to using regular expressions. In JavaScript, this would become,

var re = new RegExp(input,"i"); // the flag i indicates case insensitive
if(re.test(str)) { /* do-some-thing-here */ }

So far, so good. Now, what happens if the user types in some special characters that are typically used in regular expressions as some special control characters? For example,

‘(‘ and ‘)’ are used for grouping and variable capturing
‘[‘ and ‘]’ are used for character set
‘{‘ and ‘}’ are used to indicate the cardinality
‘\’ is escape character
‘*’ is used to indicate 0 or more
‘+’ is used to indicate 1 or more
‘?’ is used to indicate 0 or 1

In this scenario, if someone types in a string with any of the above characters, then the above javascript will fail. So, in order to fix this, the user input string should be first fixed to make it a valid regular expression. This can be done using

var rere = new RegExp("[({[^$*+?\\\]})]","g"); /* you need 2 '\' s to mean 1 '\' and another '\' to treat ']' as special character instead of the characters ending bracket */
var reinput = input.replace(rere,"\\$1"); /* replace the special characters with a \ before them */
var re = new RegExp(reinput,"i");
if(re.test(str)) { ... }

Now if you are generating the above JavaScript code in perl, it gets a bit more complicated. Why? Because, ‘\’s are themselves have escape semantics. Also, $ symbol has special meaning in perl. So, for each ‘\’ above, it would be doubled and also, $ would be escaped as well.

Yes, it gets confusing, but it’s doable. You can see this in action at Flat Panel Plasma/LCD HDTVs. It contains a search box on the top of the image cloud and lets the user to input a string such as Panasonic, Samsung, Sony etc or even product numbers with special characters like TH-42PZ700U to find the corresponding product on the image.

Leave a comment

Filed under javascript, perl, regular expressions, Tech - Tips