A high level architecture of our Consumer 360 platform

As described in this post, Anchormen is adding user centric functionality to its analytics SaaS solution in order to support personalisation and user engagement. The primary goal of the platform is to help our customers improve the experience they provide their users on their websites and apps. One example we are currently working on are recommendations based on behaviour tracked on all available websites and apps a customer has. This does not only improve recommendation quality but also makes it possible to recommend items from shop A within shop B (cross platform/shop recommendations). Besides challenges on cross browser / channel user identification and data integration challenges (likely topics for other blog posts) it also poses challenges from a scalability perspective. This post goes into the high level design of our Consumer 360 platform.

High-level design Consumer360 platform
High-level design Consumer360 platform


Receiving events

It all starts by tracking events generated by users within websites/ apps. Events include for example page views but can also be clicks, scrolls, ‘swipes’ etc. Our event model is very flexible and can represent almost anything. These events are send to our messaging system which buffers and optionally persists them. Currently this is done by Redis but we’re migrating to Kafka. Primary purpose of this component is to separate input from the rest of the system and smooth out bursts of events.


Streaming analysis

The events are read by a Spark streaming application which performs a couple of data cleaning and enriching steps. These include things like user identification/resolution, GeoIP lookup and date-time operations. The streaming application performs these steps within a few seconds for micro batches of events which means that data is available for other parts of the system. The event data is then stored within our customer specific indexes in Elasticsearch.

Events are also used to update results calculated by the batch layer. A simple example is removal of items from recommendation list which were just viewed by the user.


Storage

Elasticsearch is used as the primary database holding event data, data imports (CRM, ERP, etc.) as well as analysis results. Elasticsearch is fast, highly suitable for time series data and integrates well with Spark. In addition, we provide dashboards for our customers so they can inspect their data. Dashboards can be created to support website analytics as well as user profiling. HDFS is used to store raw data (landing zone) and results from batch processing like prediction models or user clusters. These models are typically used by the web service layer, for example to handle requests for recommendations for a specific user.


Batch processing

Spark batch applications periodically analyse the data and create results used by other parts of the system like insights, prediction models etc. Currently there is only one job which builds prediction models and calculates. Results are stored in a cache which webservices can use get results from.

We envision other jobs within the batch layer like robust user identification (improve the identification done by the streaming job), user segmentation and evaluate A/B tests being run by customers.


Web service

The web service layer enables websites and apps to access services on the platform like requesting recommendations for a certain user. The web service validates requests, fetches the required data and returns an appropriate result. In case of a recommendation this includes requesting the prediction model to provide predictions for the user, remove any items the user has recently viewed and return a properly mixed / sized recommendation set. It is up to our customers to show requested information on their websites and apps.


To conclude

The primary components of the platform are all open source and scale horizontally which makes it relatively easy to handle increasing loads (adding new customers, increase of website/app usage etc.). At the same time, we try to minimize the number of components in order to keep the platform manageable. All components are in place and we are adding analysis components to the platform.

Back to Big Data Services