Internationalized Domain Names (Punycode Domains) - Latent Threats?


Popular Domain Sample punycodes with lookalike resultant IDNs

IDN? What's that?

Internationalized Domain Names can be thought as extensions of the traditional Latin-script ASCII-encoded domains, such as, that we are accustomed to. IDNs allow unicode charaters and thus a much wider array of characters from local scripts that use diacritics and ligatures, which cannot be directly rendered in ASCII.

The DNS "hostname rule" requires domains to be in ASCII before being stored within it. Therefore, an IDN such as apṗlê.com can be represented as an ASCII string using punycode transcription, resulting in:

Examples of IDN representations in ASCII
IDN ASCII Encoding (Punycode Transcription)

History, Why & How They Were Introduced

In the olden days of yore, looong before IDNs, the "LDH" (aka Letter-Digit-Hyphen) hostname convention reigned over the DNS and only permitted ... err ... letters, digits and hyphens within domains.

To support the various major global dialects in their native writing (scripts), IDNs were fronted. Originally proposed in 1996, IDNs were formally introduced circa 2003 (christened "IDNA2003") after the implementation guidelines version 1.0 was published. The latter was then revised in 2008 ("IDNA2008"), approved in 2010 and still is the current recommended implementation. However, IDNA2008 disallowed around 8000 characters that used to be valid per IDNA2003 including all uppercase characters, full/half-width variants, symbols, and punctuation. Such teething issues, backward compatibility included, could have driven IDN owners up the wall but their seemingly meagre adoption worldwide, at the time, allowed a conflict-free transition.

To date, the scripts allowed stand at 23 by count, representing 37 languages (a script is a set of characters used to write one or multiple languages). The scripts include: Arabic, Armenian, Bengali, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmukhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Lao, Latin, Malayalam, Oriya, Sinhala, Tamil, Telugu, and Thai.

The represented languages include Arabic, Armenian, Assamese, Bangla, Belarusian, Bengali, Bulgarian, Chinese, Georgian, Greek, Gujarati, Hebrew, Hindi, Japanese, Kannada, Kashmiri, Kazakh, Korean, Lao, Macedonian, Malay, Malayalam, Mongolian, Oriya, Persian, Punjabi, Russian, Sanskrit, Santali, Serbian, Sindhi, Sinhalese, Tamil, Telugu, Thai, Ukrainian, and Urdu.

Sample IDN TLDs Sample Traditional Latin Script TLDs

Fully localized IDNs, where the TLD part includes these additional scripts' characters as well, are an odder rarity in the wild. For instance: ベリサイン.コム. They are often either only supported by registrars serving the script's native region or mega-registrars whose majority clientelle spans the globe. They are still a slightly perturbing spectacle that slowly turns into warm familiarity to a keen pair of eyes during their first encounter.

Permitting Unicode characters in IDNs allowed some clever flexibillity and expressiveness that brought emoji domains. Owing to the usual sparse registrar implementation, at the time of publishing this, there are about ten TLDs that support emojis domain registration. They are: .uz, .cf, .ga, .gq, .ml, .tk, .st, .fm, .to, .kz and .ws

Emoji domains are offered by a handful of registrars — a search engine query away.

Sample Emoji Domain Punycode

Marketshare & Registrars Providing Them

Courtesy: ICANN

Vastly dwarfed in popularity by traditional Latin script TLDs, most IDN TLDs still register well under five figure counts while a handful others are yet to start accepting domain registrations. According to ICANN, there were just 1.67M IDN domains registered under gTLDs at the end of 2020, which was around 0.8% of total gTLD regiistrations. Unsurprisingly, the .com TLD tops at over 150 million domain registrations under it. Here's how the rest of the top 20 TLDs stack up:

Top 20 TLDs by domain count (stats sourced at the beginning of 2021)
TLD Approximate Domain Registrations
.com 152.2M
.tk 25.9M
.de 15.2M
.net 13.2M
.cn 11.7M
.uk 10.4M
.org 10.3M
.nl 5.4M
.ru 4.9M
.ga 4.4M
.cf 4.1M
.info 4.0M
.br 3.9M
.ml 3.7M
.fr 3.6M
.eu 3.5M
.gq 3.3M
.it 3.0M
.au 3.0M
.xyz 2.9M

The sporadic prevalence if IDNs is therefore inevitable. Further exacerbated by domain registrars' inertia to adopt IDNs whose scripts are non-native to their target market. Thus if you wish to register Greek IDN, your chances of success taper out the further you deviate from Greece as the locus.

Latent Threats?

Contrasted to the commonly (ab)used traditional Latin script domain squatting and phishing attempts, IDNs offer a much larger attack surface with the accented characters. The permutations are unnerving. Attackers can mix and match from a vexingly vast set of characters to drive both targeted and mass-scale trawling attacks. The cynic in you whispers "This is why we can't have nice things" and other cynical musings.

Some more reasons why IDNs should be factored into your threat models:
  • Some font families mangle diacritics around characters while others exhibit inexistent support for ligatures to the point of outrightly omitting them. So for those using non-standard/custom system and application fonts, opening links turns into digital Russian roulette of sorts.
  • The novelty and minimal presence of IDNs in maistream branding enables IDN-based phishing to fly under radar. Additionally, most fraud analysis tools only do static URL analysis.
  • Emoji domains, both colour-wise and structure-wise, are not uniformly displayed across operating systems and applications rendering engines. These inconsistencies bring out some element of brand erosion
  • Permutation-wise, it goes without saying that the longer the domain name, the more the possible squattable IDNs. This applies to non-IDN Latin script domains as well.
  • Punycode links are long and often less comprehensible and therefore easily fatiguing thus slipping through our sub-second visual defence attempts
  • Plenty protocols and applications can only handle Latin script characters. This may not be a direct security threat but still a compatibility regression.

IDN squatting vs traditional Latin script squatting

In summation:

  • Once again, increased character variations -> larger attack surface.
  • Non-uniform adoption by domain registrars globally -> information asymmetry, favouring attackers.
  • Inadequate threat modelling by domain owners -> free reign for attackers.
  • Deviations between IDNA2003 and IDNA2008 implementations -> increased edge cases to be considered (read legacy software).


  • IDNs should generally be the secondary/tertiary domain rather than the primary, with which a business/brand identifies, unless:
    • The IDN is irreducibly and unmistakably simple enough to be rendered consistenlty across the target software platforms.
    • The target demographic or purpose requires the use IDNs for succinctness in branding or conveying a message ... et cetera.
    • It is for brand protection. Pre-emptive purchases of squattable domains, closest in resemblance to yours, denies attackers much of their firepower.
  • Correspondingly, a winning approach has both proactive and reactive elements to it. If interested in a tool that does the monitoring for you, our Infringement Monitor catches both static and behavioural tricks employed in squatting, phishing campaigns and plagiarism attempts. Infringement severity is graded as a percentage and an alert sent to you when a set threshold is met. It suits any entity with an online presence and looking to protect their efforts, be it a SaaS, blog or a brand. It is a worthwhile investment for anyone looking for simple domain monitoring tools as well. Try it out through our Android app.