Using Antlr4 in Bloomreach Experience Manager

June 8, 2020

Using Antlr4 in Bloomreach Experience Manager

Embedding script in CMS content allows content creators to use expressions which are evaluated when the content is served. The evaluated results can be formatted in a highly versatile manner and embedded expressions can refer to data obtained from any type of source.

In this article, I will show how you can define a custom embedded scripting language and implement infrastructure for using it in a CMS. It specifically targets the Bloomreach CMS and web application platform, as does the hands on example (see link at end of text), but you can translate the general concepts explained here to any CMS.

By using a custom embedded scripting language, you can combine data from all kinds of sources. Moreover, the combinations are not limited by predefined algorithms in serverside code. In particular, when you have no control over the (external) data source such a technique may come in handy when requirements vary by the week.

Another reason for adopting this technique can be to capture certain key phrases, values and other text fragments in a central location and use placeholders in the content. This ensures that no variation in spelling occurs and allows for instantaneous site wide changes should it be necessary.

The custom scripting language in question can be tailored to a specific purpose. For instance, you can create a language that is intended to manipulate dates and has built in functions to find the start or end of a quarter given an arbitrary date, or that can add any number of days, months and years to a given date, with optional formatting. You would use it in this manner:

CMS Source Rendered
On ${* startDate | dateLong} the percentage was ${* percent.initial | 2.2f } and a mere six months later, on ${* startDate + 6m | dateShort } it had risen to ${* percent.firstStop | 2.2f }. On May 13, 2019 the percentage was 1,23 and a mere six months later, on November 13, it had risen to 1,62.

Or you create a language that takes numbers from an external source and performs some arithmatics while neatly formatting the end result in a versatile way, like this example:

CMS Source Rendered
The growth in Q3 was ${* ((final - start) / (final + start)) * 100 | 4.2f } % The growth in Q3 was 23,71 %

In these examples, the expressions in the CMS source have formatting directives that follow the expression separated by a pipeline character. It is assumed that some external source provides the values used in the expressions. As you can see, the expressions are embedded in otherwise ordinary content. In general content creators won’t be writing whole reams of code with your custom script language, but only short expressions.

What do we need to make this work?
Before delving into the specifics of a custom embedded language, let me give an overview of what we need to make this work. We have these elements:

  • A rich text field in a document that contains HTML plus embedded expressions.
  • A tool for scanning HTML content and evaluating the expressions.
  • Data that is referred to in the expressions.
  • An interception that ties all of the above together when the document is rendered.

To start with the interception, you must find a method or callback in your application server where you can process the content before it is sent back to the webbrowser. In the Bloomreach platform, a ContentRewriter is best suited. The examples in the Bloomrech documententation show how you can change HTML before it is sent to the browser.

How to obtain external data from a source is beyond the scope of this article. However, what you need to do once you have obtained the data is create a system for resolving identifiers used in the expressions to values present in your data. This can be as easy as gathering all external data in a Map<String,Object>.

The tool for scanning rich text field is described below and of course the script expressions themselves can be edited as normal content text in the CMS.

Markers
In order to start evaluating script expressions in the content, the first task at hand is to determine the difference between ordinary HTML and embedded expressions. For this, we will use markers which are special combinations of normal characters.

Everything inside of an opening marker and its matching closing marker is evaluated as embedded script that follows the syntax rules of our language. When choosing the combination of characters that together form a marker, some considerations are that it must be easy to remember for your content creators, but also strange enough that is is not likely to ever crop up as legitimate character sequence in ordinary content. In the examples I settled on ${* for the opening marker and } for closing.

Once we know how to look for embedded script, we can cook up a regular expression to run over the input HTML and isolate the script fragments. But as it turns out, there is a better way.

Another Tool for Language Recognition
Antlr4 is a tool that generates Java code for parsing a language, using a description of the syntax as its input. By writing down the syntax rules of our custom scripting language in a way that antlr4 can understand, we can make it cough up the Java source code that forms the heart of our scripting engine.

The nice thing about antlr4 is that it can support so called island grammars where the islands of meaningful code (our embedded expressions) are surrounded by a sea of irrelevant characters (the ordinary HTML). This is how we avoid having to write regular expressions to separate expressions from HTML.

Of course, we will need to fill in the parts where we want the expressions to actually do something, but that is pretty straightforward once you get the hang of it. For example, the generated code will determine that there is a plus sign between two values and call a method to handle this. We must override that handler, extract the values, add them together and return the result. The generated code only knows that a plus sign between values event occured, but it does not know what such an event means.

Another advantage of using antlr4 is that you get intimate with the theoretical and practical aspects of defining the syntax rules for a computer language. Or perhaps this would count as a disadvantage, I’m not sure.

Hands on
Should you be interested in playing around with a Bloomreach webapplication that deals with embedded expressions in the content, I have created a small project with a very simple expression language that demonstrates an implementation of the above. It does basic arithmatics on numbers and uses a JSON data file for an external data source. You can find it together with more technical notes at:

https://gitlab.indivirtual.com/mario.pinkster/vermilion

Mario Pinkster

Mario Pinkster

Java Developer at Sentia Consultancy