Stephan Lonntorp
Oct 28, 2016
  5019
(3 votes)

URL Transliteration for EPiServer CMS 10

A while back we built a website that had a chinese language version, and we had a few issues with URL Segments not looking very nice. I took me a while to figure out that what I was looking for is called Transliteration. I implemented a really hacky way of modifying the URL Segment that EPiServer produces, so that I could inject my transliterated page name, instead of the page name in chinese.

Now in CMS 10, the old UrlSegment class has been removed, and instead we have the IUrlSegmentGenerator, IUrlSegmentCreator and IUrlSegmentLocator, you can read more about that in the release note CMS-3824.

The default implementations for these interfaces are all internal, so it's still a bit hacky to extend, but, I've implemented a Transliterating UrlSegmentGenerator, and swaped out the implementation, so you don't have to.

OK, so what is transliteration, and why is this important?

Let's say we have a page named "伤寒论 勘误" (I don't know what that means, it's just some chinese text that I copied). The default UrlSegmentGenerator would produce the url "-", since everything but alphanumeric chars are stripped out, so the only thing that remains is the whitespace character in the name.

Using transliteration, the chinese characters are converted to their alphanumeric versions, so the same input string "伤寒论 勘误" is converted to "Shang Han Lun Kan Wu", and the Transliterating UrlSegmentGenerator then produces the url "shang-han-lun-kan-wu".

Granted, I don't know chinese, so I can't verify that this is 100% correct. But I do know that "shang-han-lun-kan-wu" is a better representation than "-", since three pages in chinese, in the same location, would have the urls "-", "-1" and"-2" using the default generator.

This approach should work for all languages, not just chinese, but you'll have to test it for yourself, if you find any bugs, please let us know by sending a pull request.

The code is available at https://github.com/creunaab/EPi.UrlTransliterator, and a package with the same name should be available in the EPiServer NuGet feed shortly.

Oct 28, 2016

Comments

Oct 28, 2016 12:06 PM

Nice work!

Oct 28, 2016 03:08 PM

Yes as you have pointed out we have made it possible to change the default handling for url segments.

We will officially support this from version 10.1.0 (no need to replace implementations in container) where we have added encoding support as well (even if most browsers handle unencoded urls with IRI characters the recommendation is to encode such characters). It will be announced when we release 10.1.0.

Stephan Lonntorp
Stephan Lonntorp Oct 28, 2016 03:25 PM

@Johan, care to elaborate? Have you implemented transliteration, or encoding? or both?

Oct 28, 2016 11:02 PM

There is a class UrlSegmentOptions registered as singleton in IOC container where you can specify which regexp an url segment should be validated against (this exist in cms 10 as well), meaning you can for example specify a regexp that allows unicode characters. So you can replace default instance with your own instance.

What we have added in 10.1 is encoding, that is that IRI urls gets encoded. In cms 10 those urls will not be encoded (most browsers will handle them correctly anyway). In cms 10.1 we have also opened up simple address to allow IRI characters.

Oct 28, 2016 11:07 PM

So to clarify you do not need to replace IUrlSegmentGenerator in IOC container, you can instead set the regexp on UrlSegmentOptions.

Vincent
Vincent Nov 1, 2016 12:54 AM

Nice work mate.

I can read Chinese, and I can confirm each Chinese character is translated to appropriate Pinyin. 

Stephan Lonntorp
Stephan Lonntorp Nov 2, 2016 01:57 PM

@code monkey: Thanks!

@Johan: Being able to replace the regexp isn't really useful for transliteration though, I use another library for transliteration, and AFAIK that has nothing to do with regular expressions. It's nice that you've made it configurable, but being able to change a regulare expression really just caters to a use case for using regular expressions to generate url segments.

Please login to comment.
Latest blogs
Optimizely Forms: You cannot submit this form because an administrator has turned off data storage.

Do not let this error message scare you, the solution is quite simple!

Tomas Hensrud Gulla | Oct 4, 2024 | Syndicated blog

Add your own tools to the Optimizely CMS 12 admin menu

The menus in Optimizely CMS can be extended using a MenuProvider, and using the path parameter you decide what menu you want to add additional menu...

Tomas Hensrud Gulla | Oct 3, 2024 | Syndicated blog

Integrating Optimizely DAM with Your Website

This article is the second in a series about integrating Optimizely DAM with websites. It discusses how to install the necessary package and code t...

Andrew Markham | Sep 28, 2024 | Syndicated blog

Opticon 2024 - highlights

I went to Opticon in Stockholm and here are my brief highlights based on the demos, presentations and roadmaps  Optimizely CMS SaaS will start to...

Daniel Ovaska | Sep 27, 2024

Required fields support in Optimizely Graph

It's been possible to have "required" properties (value must be entered) in the CMS for a long time. The required metadata haven't been reflected i...

Jonas Bergqvist | Sep 25, 2024

How to write a bespoke notification management system

Websites can be the perfect vehicle for notifying customers of important information quickly, whether it’s the latest offer, an operational message...

Nicole Drath | Sep 25, 2024