Unicode is a industry standard that is used to assign a unique number for every character independent of the platform or application. There can be different set of encoding systems used to represent a single language. For instance, English uses several encodings to cover all letters, symbols and punctuation.
One of the major problems is:
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption. (http://unicode.org/standard/WhatIsUnicode.html)
This shows that the use of Unicode system may reveal a serious threat to end users, applications, operating systems and programming languages. Unicode v5 is a complex and large standard, such that, it provides code points, normalization, case mapping, categorization, escapings, conversion tables, binary properties, etc. Additionally, it includes several code pages and charsets like Shift_jis, Gb2312, Windows-1252, ISO-8859-1, EBCDIC-037. Furthermore, the ASCII range is reserved from U+0000 to U+007F. Unicode v5.1 holds a 21-bit scalar value with space for over 1,100,000 code points (U+0000 - U+10FFFF). For instance, the english character 'A' represents U+0041 value.
Encodings with different number of bits can be presented as:
UTF-8 (variable width 1-4 bytes)
UTF-16 (Endianess, variable width 2 or 4 bytes)
UTF-32 (Endianess, Fixed width 4 bytes, Fixed mapping)
After anticipating the above mentioned properties of Unicode system, it is quite obvious to find the root causes of data encoding and transformation problems. Some of them are listed below:
-Visual Spoofing
-Best-fit mappings
-Normalization
-Overlong UTF-8
-Character Substitution
-Character Deletion
-Casing
-Buffer Overflows
-Controlling Syntax
-Charset Transformation
-Charset Mismatch
Putting in consideration only one problem domain 'Visual Spoofing' which governs that in over 1,100,000 assigned characters look alike within the same or across multiple language scripts. The example is given below:
Such problems are the real threats. In the real-world attack scenario on International Domain Names (IDN), these can be used to spoof the actual website. For example:
gobiz.com "is not" gobiz.com
The first letter of the 1st Domain contains "Latin U+0069 char" and the first letter of the 2nd domain represents "Latin U+0261 char". Does it make any visual difference? Thus, some of the main attack vectors that leverages visual spoofing are:
-Non-unicode attacks
-Problematic font-rendering
-Confusable charaters
-Manipulating combining marks
-Syntax spoofing
Tools that can help interpret such problems within web applications are:
Watcher
http://websecuritytool.codeplex.com/
-Passive web application auditing
Unibomber
http://www.casabasecurity.com/content/unibomber-tool-specialized-xss-testing
-XSS autopwn testing tool