We are in an era of knowledge explosion. With the rapid increase of information and the vigorous development of artificial intelligence, Internet companies have increasingly strong demand for personalized and intelligent information display. The typical applications of personalized information display mainly include search list, recommendation list, advertisement display and so on.

What many people don’t know is that behind the seemingly simple personalized information display, there are a lot of data, algorithms and engineering architecture technologies, which are enough to deter most Internet companies. The fundamental reason is that the technology behind personalized information presentation is learning to Rank. Obviously, most of the articles on sequencing learning are partial to algorithms or engineering. Although there are some systematic introductory articles on algorithms, they often require high mathematical ability of readers, but also partial to academic papers. For non-algorithmic students, the threshold is very high. Most of the engineering articles are rough, basically staying in Google’s Two-Phase Scheme stage, from the perspective of project implementation, it is far from specific.

For those teams that are responsible for the online sequencing architecture by system development engineers, this paper will use popular examples and analogies to illustrate the algorithm part, hoping to help you better understand and grasp the core concepts of sequencing learning. If you are a member of the team of algorithmic engineers, you can ignore the content of the algorithmic part. The architecture part of this paper describes the system running on the store catering business line, which can be used as a reference prototype for the architecture design of online sorting system. The concept of this architecture in service governance and hierarchical design has certain reference value for guaranteeing the high performance, high availability and maintainability of online sorting architecture. The implementation schemes including many specific links can also be used for reference directly, such as flow bucket dividing, flow classification, feature model, cascade model and so on.

In conclusion, it is an important goal of this paper to enable development engineers to understand the core concepts of sequencing learning algorithms and to provide a fine-grained reference framework for online architecture implementation.

Algorithm part

Machine learning involves optimization theory, statistics, numerical calculation and other fields. This is a big obstacle for system development engineers who want to learn the core concepts of machine learning. However, behind complex concepts, there is often a simple truth. This section will try to use popular examples and analogies to uncover some of the core concepts of machine learning and sequencing learning.

machine learning

What is machine learning?

Typical machine learning problems are shown in the following figure:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Machine learning models or algorithms (Model/Algorithms) make predictions based on observed eigenvalues and give predictions or targets (Prediction/Target). This is like a function calculation process, for a specific X value (Feature), the algorithm model is like a function, the final prediction result is the Y value. It is not difficult to understand that the core problem of machine learning is how to get the predictive function.

Wikipedia defines machine learning as follows:

“Machine learning is a subset of artificial intelligence in the field of computer science that often uses statistical techniques to give computers the ability to learn with data, without being explicitly programmed.”

The most important essence of machine learning is to learn from data and get predictive function. The process of human thinking and judgment is essentially a function of processing. Learning from data or experience is a common thing for human beings. For example, people invented sundials by observing the length of the shadows of objects illuminated by the sun, thus possessing the ability of timing and setting solar terms. The ancient Egyptians invented the ancient Egyptian calendar through the fluctuations of the Nile River.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

For example, people invented the lunar calendar by observing the shape of the moon.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers
If a machine can learn from data like a human, in a sense, it will have a certain degree of “intelligence”. The two questions that need to be answered now are:

  • What is Intelligence?
  • How to make machines intelligent?

What is intelligence?

Before answering this question, let’s first look at why traditional programming patterns can’t be called “intelligence”. The traditional programming pattern is shown in the following figure. It generally goes through the following stages:

  • Man summarizes experience by observing data and transforms it into knowledge.
  • Human beings transform knowledge into rules.
  • Engineers translate rules into computer programs.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

In this programming mode, if a problem is covered by rules, the computer program can handle it. For the problem that rules can not cover, it is only for human beings to rethink and formulate new rules to solve. So here the role of “intelligence” is mainly assumed by human beings. Human beings are responsible for solving new problems, so traditional programs themselves can not be called “intelligence”.

Therefore, one of the core elements of “intelligence” is to “draw inferences from inferences”.

How to make machines intelligent?

Before discussing this issue, let’s first review how human beings acquire the ability to draw inferences from each other. The basic process is as follows:

  • Teachers give students some questions to instruct them how to solve them. Students try to grasp the idea of solving problems and try their best to make their answers consistent with those given by teachers.
  • Students need to pass some examinations to prove that they have the ability to draw inferences from one another. If they pass these examinations, they will receive a diploma or a qualification certificate.
  • When students become practitioners, they will face and deal with many new problems that they have never met before.

Machine learning experts get inspiration from human learning process, through three stages, the machine has the ability to “draw inferences from inferences”. The three stages are training, testing and Inference. The following is an introduction one by one:

Training phase

The training phase is shown in the following figure:

  • Human beings give machine learning models some training samples (X, Y), X represents features, Y represents target values. It’s like a teacher teaching students to solve a problem, X for a problem, Y for a standard answer.
  • Machine learning models try to come up with a way to solve problems.
  • In the training stage, the goal of machine learning is to minimize the loss function. Analogy students try to make their answers the least different from the standard answers given by teachers, thinking the same way.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Testing phase

The test phase is shown in the following figure:

  • Humans give trained models a completely different set of test samples (X, Y). It’s like a student getting an exam paper.
  • The model is deduced. This process is like a student answering a question in an exam.
  • The total loss function value of the test sample is required to be lower than the set minimum target value. It’s just like schools require students to pass exams.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Derivation stage

The derivation stage is shown in the following figure:

  • At this stage, the machine learning model can only get eigenvalue X, but no target value. It’s like working, people are just solving problems one by one, but they don’t know what the right results are.
  • In the derivation stage, the goal of machine learning is to predict and give the target value.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Sequencing learning

What is sequential learning?

Wikipedia defines sequencing learning as follows:

“Learning to rank is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. “relevant” or “not relevant”) for each item. The ranking model’s purpose is to rank, i.e. produce a permutation of items in new, unseen lists in a way which is “similar” to rankings in the training data in some sense.”

Sorting learning is the application of machine learning in information retrieval system. Its goal is to construct a sorting model for sorting lists. Typical applications of ranking learning include search lists, recommendation lists, advertising lists, and so on.

The goal of list sorting is to sort multiple items, which means that its target value is structured. Compared with single-valued regression and single-valued classification, structured objectives require two widely proposed concepts to be addressed:

  • List evaluation index
  • List Training Algorithms

List evaluation index

Taking keyword search returning to the list of articles as an example, this paper first analyses what challenges the list evaluation index should solve.

  • The first challenge is to define the degree of correlation between articles and keywords, which determines the position of an article in the list. The higher the degree of correlation, the higher the ranking should be.
  • The second challenge is how to score the entire list when some articles are not in the right place. For example, if a key word is sorted correctly according to its relevance, documents 1, 2, 3, 4 and 5 should rank first five in turn. The challenge now is to assess the pros and cons of “2, 1, 3, 4, 5” and “1, 2, 5, 4, 3” lists.

Overall, the evaluation index system of list sorting has gone through three stages: Precision and Recall, Discounted Cumulative Gain (DCG) and Expected Reciprocal Rank (ERR). Let’s explain one by one.

Precision and Recall(P-R)

The evaluation system measures the ranking quality of the list by two indicators, Precision and Recall. For a request keyword, all documents are marked as related and unrelated.

Precision is defined as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Recall is defined as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

For example, for a request keyword, 200 articles are actually relevant. A sorting algorithm only considers 100 articles to be relevant, and of these 100 articles, only 80 are really relevant. According to the above definition:

  • Accuracy = 80/100 = 0.8
  • Recall rate = 80/200 = 0.4.

Discounted Cumulative Gain(DCG)

P-R has two obvious disadvantages:

  • All articles are divided into two categories: related and unrelated. The classification is obviously too rough.
  • The location factor is not taken into account.

DCG solves these two problems. For a keyword, all documents can be divided into multiple correlation levels, which are represented by rel1, rel2…. The contribution of correlation to the evaluation index of the whole list decreases logarithmically with the increase of position, and the more backward the position is, the more serious the attenuation is. Based on the DCG evaluation index, the evaluation index of the first p documents in the list is defined as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

For sorting engines, the result list length of different requests is often different. When comparing the comprehensive sorting performance of different sorting engines, the comparability of DCG indicators between requests of different lengths is not high. Nowadays, Normalized DCG (nDCG) is commonly used in industry. It assumes that a perfect sorting list of the first p positions of a request can be obtained. The score of this perfect list is called Ideal DCG (IDCG). nDCG equals the ratio of DCG to IDCG. So nDCG is a value between 0 and 1.

The definition of nDCG is as follows:
In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

IDCG is defined as follows:
In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

| REL | represents a list of results sorted by correlation up to position P.

Expected Reciprocal Rank(ERR)

Compared with DCG, ERR takes a step further and considers the relevance of all documents that precede the document, in addition to considering location attenuation and allowing multiple levels of correlation (expressed in terms of R1, R2, R3…). Document A, for example, is very relevant, ranking fifth. Document A contributes a lot to the list if the top four documents are not highly correlated. Conversely, if the first four documents are highly correlated and have completely solved the user’s search needs, the user will not click on the document in the fifth location at all, then the contribution of document A to the list will be small.

ERR is defined as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

List Training Algorithms

Engineers doing list sorting often hear concepts such as Pointwise, Pairwise and Listwise. What are these things and what are the principles behind them? This will be decrypted one by one.

Still taking keyword search articles as an example, the goal of ranking learning algorithm is to rank the list of articles for a given keyword. As an analogy, suppose a scholar wants to predict the ranking of students in various sciences. The corresponding relationships among the various roles are as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

First of all, we need to tell scholars about the various attributes of each student, just like we need to tell sorting algorithm document characteristics. For target values, there are three ways to tell scholars:

  • For each subject, we can tell scholars the results of each student. Comparing the results of each student, scholars can certainly calculate the final ranking of each student. This training method is called Pointwise. For Pointwise algorithm, if the final prediction target is a real value, it is a regression problem. If the target is probability prediction, this is a classification problem, such as CTR prediction.
  • For each subject, we can tell scholars that any two students rank each other. According to the ranking among students, scholars can also calculate the final ranking of each student. This training method is called Pairwise. Pairwise algorithm aims to reduce the number of inverses, so it is a binary classification problem.
  • For each subject, we can directly tell scholars the overall ranking of all students. This training method is called Listwise. The goal of Listwise algorithm is to directly optimize the evaluation indexes such as nDCG and ERR.

These three methods may seem like word games on the surface, but behind them are the results of constant exploration by engineers and scientists. The most intuitive solution is the Pointwise algorithm, for example, for advertising CTR prediction, it is relatively easy to mark the click probability of a document in the training phase. An important branch of Pairwise algorithm is Lambda series, including Lambda Rank, Lambda Mart and so on. The core idea of Pairwise algorithm is that it is difficult to calculate the loss function directly, but it is easy to calculate the gradient of loss function. This means that it’s difficult to calculate the nDCG and ERR metrics of the entire list, but it’s easy to know whether a document should be ranked higher or lower. Listwise algorithm often works best, but how to annotate all documents for each request is a huge challenge.

Online Sorting Architecture

Typical information retrieval includes two stages: index stage and query stage. The processes of these two phases and their interrelationships can be represented by the following figures:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

In the indexing stage, the indexer reads the document (Documents) and builds the index.

In the query stage, the index is read as a recall, and then it is given to Topn Retriever for roughing, in which the first N documents are passed to Reranker for fine-tuning. Such a recall, rough and elaborate architecture was originally proposed by Google, also known as the “Two-Phase Scheme”.

The index part belongs to the offline stage, which focuses on the online sorting stage, namely the query stage.

Three challenges

Online sorting architecture faces three main challenges: feature, model and recall.

  • Feature challenges include feature addition, feature operator, feature normalization, feature discretization, feature acquisition, feature service governance, etc.
  • Model challenges include basic model completeness, cascade model, composite target, A/B experimental support, model hot loading, etc.
  • The challenges of recall include keyword recall, LBS recall, recommendation recall, rough-line recall and so on.

There are many more fine-grained challenges in the three major challenges. Solving each challenge in isolation is obviously not a good idea. As a widely used architecture, online ranking deserves to be solved by domain model. The three principles of Domain-driven design (DDD) are domain focus, clear boundary and continuous integration.

Based on the above analysis, we construct three online ranking domain models: recall governance, feature service governance and online ranking hierarchical model.

Recall management

The classic Two-Phase Scheme architecture is shown in the following figure, and the query phase should include recall, rough and fine-tuning. But from the perspective of domain architecture design, rough layout is also a kind of recall for fine-tuning. Unlike traditional text-based search, AOT reviews such O2 O companies need to consider such factors as geographical location and distance, so LBS-based recall is also a kind of recall. Unlike search, recommendation recalls are often based on collaborative filtering, such as User-Based CF and Item-Based CF.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

In summary, recalls are generally divided into four categories:

  • Keyword recall, we use Elastic search solution.
  • Distance recall, we use K-D tree solution.
  • Rough row recall.
  • Recommended class recall.

Characteristic Service Governance

Traditionally, feature services should be classified into user features, query features and document features, as follows:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

This is a purely business perspective and does not satisfy DDD’s domain architecture design ideas. Because of the huge number of features, we can not design a set of solutions for each feature, but we can classify features and design solutions for several types of features separately. Each type of technical solution needs to consider performance, availability, storage and other factors in a unified way. From a domain perspective, feature services can be divided into four categories:

  • List class features. A single request requires the return of entity list features, that is, multiple entities, each entity with multiple features. This feature suggests using the memory list service to return all the features of all requesting entities at one time. Avoid multiple requests, resulting in a surge in the number and system avalanche.
  • Substantive characteristics. A single request returns multiple features of a single entity. Redis and Tair are recommended to support multi-level key-value services.
  • Context features. Including recall static score, city, Query features and so on. These features are placed directly in the request memory.
  • Similarity characteristics. Some features are obtained by calculating similarities between individuals and lists, lists and lists. It is recommended that a separate memory computing service be provided to avoid the impact of computing such features on online sorting performance. Essentially, this is a design of computational transfer.

Hierarchical Model of Online Sorting

As shown in the figure below, a typical sorting process consists of six steps: Scene Dispatch, Traffic Distribution, Recall, Feature Retrieval, Prediction, Ranking and so on.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

According to the design principle of DDD, we designed the following online sorting hierarchical models, including Scene Dispatch, Model Distribution, Ranking, Feature Pipeline and Prediction Pipeline. We will introduce them one by one.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Scene Dispatch

Scenario distribution generally refers to the distribution of business types. For the group review, it includes: sub-platform, sub-list, sub-use scenarios and so on. As shown in the following figure:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Model Distribution

The goal of model distribution is to distribute online traffic to different experimental models, specifically to achieve three functions:

  • Provide online traffic for model iteration, responsible for online effect collection, validation and so on.
  • A/B test ensures the stability, independence and exclusion of traffic among different models, and ensures that the effect belongs to the only one.
  • Ensure the orthogonality of experimental flow with other layers.

The definition of traffic is a basic problem in model distribution. Typical traffic includes access, users and devices.

How can a traffic be mapped steadily to a particular model? Is there a level between traffic? These are the key issues for model distribution.

Principle of flow dividing bucket

The following steps are used to distribute traffic to a specific model:

  • Divide all flow into N barrels.
  • Each specific flow Hash goes into a bucket.
  • Give each model a certain quota, that is, each strategy model occupies the corresponding proportion of the flow bucket.
  • The total flow quota of all policy models is 100%.
  • When the flow and the model fall into the same bucket, the model owns the flow.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

For example, as shown in the figure above, all traffic flows are divided into 32 barrels. The three models A, B and C have 37.5%, 25% and 37.5% quotas respectively. Correspondingly, A, B and C should occupy 12, 8 and 12 barrels.

In order to ensure the orthogonality of the model and traffic, the HasKey of the model and traffic uses different prefixes.

Traffic classification

Each team’s model classification strategy is different. Here only one recommended model traffic classification is given:

  • Baseline traffic. This traffic is used to compare with other traffic to determine whether the effect of the new model is higher than the baseline, and the model below the baseline should be quickly offline. In addition, the improvement of the main traffic relative to the baseline traffic is also an important indicator to measure the contribution of the algorithm team.
  • The experimental flow rate. The flow rate is mainly used in the new experimental model. The design of the flow size should pay attention to two points: the first is not too big to harm the effect of the line; the second is not too small, too small flow will lead to too large variance, which is not conducive to making correct effect judgment.
  • Potential flow. If the experimental flow rate is better in a certain period, it can be upgraded to potential flow rate. Potential flow is mainly to solve the problem caused by large variance of experimental flow.
  • Main flow. There is only one main flow, that is, the flow with the best stable operation effect. If a potential flow is better than other potential flow and main flow in the long run, we can consider upgrading this potential flow to main flow.

In the process of experiment, it is necessary to avoid the impact of the new experimental flow on the old model flow. Flow population will have a certain adaptation period for the new model, and the effect of the adaptation period is generally worse than that of the stable period. If the model of the whole traffic group is changed because of the new experiment’s on-line, there is no change in the contrast relationship between the models from a statistical point of view. But this may affect the overall effect of the market, the cost is high.

To solve this problem, our traffic bucket model gives priority to the model in front of the model list, and the experimental model is placed at the end of the list as far as possible. Thus, the frequent upstream and downstream of the experimental model does not affect the main and potential traffic user groups. Of course, when model traffic upgrade occurs, the service model of many traffic users will change. This is not a problem, because on the one hand, we are trying to make more users use better models, on the other hand, it is unfair to keep some users using experimental traffic for a long time.


The sorting module is the container of the feature module and the prediction module. Its main responsibilities are as follows:

  • Get all the features needed by the list entities for prediction.
  • The feature is handed over to the prediction module for prediction.
  • Sort all list entities according to predicted values.

Feature Pipeline

Feature pipeline includes feature model, expression, Atomic Feature, Feature Proxy and Feature Service. As shown in the following figure:

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

The characteristic pipeline has two core problems to solve:

  • Given the feature name, the corresponding eigenvalue is obtained. This process is very complex, and the transformation process of feature name – > feature service – > feature class – > feature value should be completed.
  • Characteristic operator problem. The features used in the model are often the result of compound operations on multiple atomic features. In addition, feature discretization and normalization are also problems that need to be solved by feature operators.

The complete feature acquisition process is shown in the following figure, and the specific process is as follows:

  • The Ranking module reads all the original feature names from the Feature Model.
  • Ranking gives all the original feature names to Feature Proxy.
  • Feature Proxy calls the corresponding Feature Service based on the identity of the feature name and returns the original feature value to the Ranking module.
  • The Ranking module transforms the original features into composite features through expression.
  • Ranking module gives all the features to the cascade model for further transformation.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Feature Model

We put all the information related to feature acquisition and feature operator in a class, which is called Feature Model. It is defined as follows:

// Including meta information for feature acquisition and feature operator calculation 
    public class FeatureModel {   
        // This is the real feature name used in Prediction 
        private String featureName;  
        // Several atomic features are combined into composite features by expression.
        private IExpression expression; 
        // These feature names are set of feature names that are actually handed over to the Feature Proxy to get the feature values from the server.
        private Set<String> originalFeatureNames;
        // Used to indicate whether features need to be transformed by cascading models  
        private boolean isTransformedFeature; 
        // Is it one-hot feature?    
        private boolean isOneHotIdFeature;
        // The same original feature is often shared between different one-hot features. This variable > is used to identify the original feature name. 
        private String oneHotIdKey; 
        // Indicate whether the feature needs normalization
        private boolean isNormalized; 


The purpose of the expression is to transform multiple original features into a new feature, or to perform operator transformation for a single original feature. We use the prefix expression (Polish Notation) to represent the operation of feature operators. For example, the prefix expression of expression (5-6)*7 is *-567.

Composite features need to specify the following separators:

  • Compound feature prefix. Unlike other types of features, we use “$” as a composite feature.
  • The separator between the elements of the expression is identified by “”.
  • Operator prefixes are represented by “O”.
  • The constant prefix is denoted by “C”.
  • The variable prefix is denoted by “V”.

For example, the expression V1 + 14.2 + (2* (v2 + v3)) will be expressed as $O+_O+_Vv1_C14.2_O*_C2_O+_Vv2_Vv3.

Atomic Feature

Atomic features (or primitive features) include two parts: feature names and eigenvalues. The reading of atomic features needs to be accomplished by four entity classes:

  • POJO is used to store the original eigenvalues. For example, DealInfo saves all the eigenvalues associated with Deal entities, including Price, maxMealPerson, minMealPerson, and so on.
  • ScoringValue is used to store the eigenvalues returned from POJO. There are three basic types of eigenvalues: Quantity, Ordinal and Categorical.
  • ScoreEnum implements the mapping of feature names to eigenvalues. Each type of atomic feature corresponds to a ScoreEnum class, and the feature name constructs the corresponding ScoreEnum class by means of reflection. ScoreEnum classes and POJOs work together to read eigenvalues.
  • FeatureScoreEnumContainer is used to store all the features required for an algorithm model.

A typical example is shown in the following figure:

  • DealInfo is a POJO class.
  • DealInfoScoreEnum is a ScoreEnum base class. We define the specific ScoreEnum classes of DIAveMealPerson and DIPrice, which correspond to the characteristics of average number of meals, price and so on.
  • FeatureScoreEnumContainer is used to store all the features of a model.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Complex system design needs to make full use of language characteristics and design patterns. Three optimization points are suggested:

  • Defining a ScoreEnum class for each atomic feature can lead to an explosion in the number of classes. The optimization method is that ScoreEnum base class is defined as Enum type, each specific feature class is an enumeration value, and the enumeration value inherits and implements the method of enumeration class.
  • FeatureScoreEnumContainer uses Build design pattern to convert the required features of an algorithm model into a set of ScoreEnum.
  • ScoreEnum uses Command mode to read specific eigenvalues from POJO classes.

Here is a brief introduction to the Command design pattern. The core idea of the Command model is that the demander only needs to get relevant information and does not care who provides it or how. The specific supplier accepts the needs of the demander and is responsible for delivering the results to the demander.

In feature reading, the demander is the model, which only provides a feature name (FeatureName), and does not care how to read the corresponding eigenvalues. Specific ScoreEnum classes are specific providers. Specific ScoreEnum reads specific eigenvalues from POJO and converts them into ScoringValue to model.

Feature Proxy

The feature service agent is responsible for the implementation of remote feature acquisition. The specific process includes:

  • Each feature or feature service has a feature Proxy, which is responsible for initiating requests to feature services and obtaining POJO classes.
  • All FeatureProxy is registered with the FeatureServiceContainer class.
  • In a specific feature acquisition, the Feature Service Container is responsible for assigning feature acquisition to different Feature Proxy classes according to the prefix of the Feature Name.
  • FeatureProxy reads POJO lists from feature services based on specified entity ID lists and feature names. Only the characteristic values of the specified keys of the corresponding ID are assigned to POJO. This minimizes the cost of network reading.

Prediction Pipeline

Prediction pipeline includes: Prediction, Cascade Model, Expression, Transform, Scoring and Atomic Model.


Prediction is essentially a model encapsulation. It is responsible for transforming the characteristics of each list entity into the input format required by the model, and making the model predict.

Cascade Model

Our cascade model is based on two observations:

  • Xgboost+LR based on Facebook’s Practical Lessons from Predicting Clicks on Ads at Facebook and the recent popular Wide&Deep show that transforming some features, especially ID-type features, through tree model or NN model, and giving the transformed values as feature values to prediction model can often achieve better results.
  • Some training objectives are compound objectives, and each sub-goal needs to be predicted by different models. The final prediction results are calculated by a simple expression between these sub-targets.

For example, as shown in the following figure, we will explain from top to bottom:

  • The model has the characteristics of UserId, CityId, UserFeature, POI and so on.
  • UserId and CityId features are transformed by GBDT and NN models, respectively.
  • The transformed features and the original features such as User Feature and POI are handed over to the NN and LR models for Scoring.
  • The final predictive score is expressed by Prediction Score = AlphaNNScore + βLRScore /(1 + gamma) to complete. The expressions of alpha, beta and gamma are pre-set values.

In-depth and shallow sequencing learning: algorithmic system development practice written to programmers

Atomic Model

Here atomic model refers to a kind of atomic computing topology, such as linear model, tree model and network model.

Common models such as Logistic Regression and Linear Regression are linear models. GBDT and Random Forest are both tree models. MLP, CNN and RNN are all network models.

The main purpose of the atomic model defined here is to facilitate the implementation of the project. There are two reasons why a model is considered an atomic model:

  • This model is often used as an independent prediction model.
  • The model has relatively complete implementation code.


This paper summarizes the author’s experience in solving the problem of beauty group comment and displaying personalized information of restaurant and catering, and elaborates from two aspects of algorithm and architecture. In the algorithm part, the paper uses popular examples and analogies to explain, hoping that non-algorithmic engineers can understand the key algorithmic concepts. The structure part elaborates on the order structure of restaurants.

According to our knowledge, the idea of feature governance and recall governance is a new perspective, which is very helpful to the design of architecture sequencing system. This way of thinking is also applicable to the construction of models in other fields. Compared with the classic Two-Phase Scheme architecture provided by Google, the online sorting hierarchical model provides a more granular Abstract prototype. The prototype elaborates a series of classical sorting architecture problems, including shunting, A/B testing, feature acquisition, feature operator, cascade model and so on. At the same time, the prototype model adopts the idea of layering and function focusing in layers, so it perfectly embodies the three design principles of DDD, namely, domain focusing, clear boundary and continuous integration.

Author brief introduction

Liu Ding has worked for Amazon and TripAdvisor. In 2014, he joined the American League, responsible for its recommendation system, intelligent screening system architecture, its advertising system architecture and online, and completed the construction of its advertising operation platform. At present, it is responsible for the strategy direction of restaurant-to-restaurant algorithm and promotes the application of AI in various fields of restaurant-to-restaurant.

Reference article:

[1]Gamma E, Helm R, Johnson R, et al. Design Patterns-Elements of Reusable Object-Oriented Software. Machinery Industry, 2003.

[2]Wikipedia,Learning to rank.

[3]Wikipedia,Machine learning.

[4]Wikipedia,Precision and recall.

[5]Wikipedia,Discounted cumulative gain.

[6]Wikipedia,Domain-driven design.


[8]Wikipedia,k-d tree.

[9] Baidu Encyclopedia, Solar Calendar.

[10] Baidu Encyclopedia, Lunar Calendar.

[11]Xinran H, Junfeng P, Ou J, et al. Practical Lessons from Predicting Clicks on Ads at Facebook

[12]Olivier C, Donald M, Ya Z, Pierre G. Expected Reciprocal Rank for Graded Relevance

[13]Heng-Tze C, Levent K, et al. Wide & Deep Learning for Recommender Systems
In-depth and shallow sequencing learning: algorithmic system development practice written to programmers