|Popular Domain||Sample punycodes with lookalike resultant IDNs|
Internationalized Domain Names can be thought as extensions of the traditional Latin-script ASCII-encoded domains, such as example.com, that we are accustomed to. IDNs allow unicode charaters and thus a much wider array of characters from local scripts that use diacritics and ligatures, which cannot be directly rendered in ASCII.
The DNS "hostname rule" requires domains to be in ASCII before being stored within it. Therefore, an IDN such as apṗlê.com can be represented as an ASCII string using punycode transcription, resulting in: xn--apl-hma7778a.com
|IDN||ASCII Encoding (Punycode Transcription)|
In the olden days of yore, looong before IDNs, the "LDH" (aka Letter-Digit-Hyphen) hostname convention reigned over the DNS and only permitted ... err ... letters, digits and hyphens within domains.
To support the various major global dialects in their native writing (scripts), IDNs were fronted. Originally proposed in 1996, IDNs were formally introduced circa 2003 (christened "IDNA2003") after the implementation guidelines version 1.0 was published. The latter was then revised in 2008 ("IDNA2008"), approved in 2010 and still is the current recommended implementation. However, IDNA2008 disallowed around 8000 characters that used to be valid per IDNA2003 including all uppercase characters, full/half-width variants, symbols, and punctuation. Such teething issues, backward compatibility included, could have driven IDN owners up the wall but their seemingly meagre adoption worldwide, at the time, allowed a conflict-free transition.
To date, the scripts allowed stand at 23 by count, representing 37 languages (a script is a set of characters used to write one or multiple languages). The scripts include: Arabic, Armenian, Bengali, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmukhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Lao, Latin, Malayalam, Oriya, Sinhala, Tamil, Telugu, and Thai.
The represented languages include Arabic, Armenian, Assamese, Bangla, Belarusian, Bengali, Bulgarian, Chinese, Georgian, Greek, Gujarati, Hebrew, Hindi, Japanese, Kannada, Kashmiri, Kazakh, Korean, Lao, Macedonian, Malay, Malayalam, Mongolian, Oriya, Persian, Punjabi, Russian, Sanskrit, Santali, Serbian, Sindhi, Sinhalese, Tamil, Telugu, Thai, Ukrainian, and Urdu.
|Sample IDN TLDs||Sample Traditional Latin Script TLDs|
Fully localized IDNs, where the TLD part includes these additional scripts' characters as well, are an odder rarity in the wild. For instance: ベリサイン.コム. They are often either only supported by registrars serving the script's native region or mega-registrars whose majority clientelle spans the globe. They are still a slightly perturbing spectacle that slowly turns into warm familiarity to a keen pair of eyes during their first encounter.
Permitting Unicode characters in IDNs allowed some clever flexibillity and expressiveness that brought emoji domains. Owing to the usual sparse registrar implementation, at the time of publishing this, there are about ten TLDs that support emojis domain registration. They are: .uz, .cf, .ga, .gq, .ml, .tk, .st, .fm, .to, .kz and .ws
Emoji domains are offered by a handful of registrars — a search engine query away.
|Sample Emoji Domain||Punycode|
Vastly dwarfed in popularity by traditional Latin script TLDs, most IDN TLDs still register well under five figure counts while a handful others are yet to start accepting domain registrations. According to ICANN, there were just 1.67M IDN domains registered under gTLDs at the end of 2020, which was around 0.8% of total gTLD regiistrations. Unsurprisingly, the .com TLD tops at over 150 million domain registrations under it. Here's how the rest of the top 20 TLDs stack up:
|TLD||Approximate Domain Registrations|
The sporadic prevalence if IDNs is therefore inevitable. Further exacerbated by domain registrars' inertia to adopt IDNs whose scripts are non-native to their target market. Thus if you wish to register Greek IDN, your chances of success taper out the further you deviate from Greece as the locus.
Contrasted to the commonly (ab)used traditional Latin script domain squatting and phishing attempts, IDNs offer a much larger attack surface with the accented characters. The permutations are unnerving. Attackers can mix and match from a vexingly vast set of characters to drive both targeted and mass-scale trawling attacks. The cynic in you whispers "This is why we can't have nice things" and other cynical musings.Some more reasons why IDNs should be factored into your threat models: