推荐系统召回层

推荐系统的线上服务模块，proxy -> 候选集 -> 召回层 -> 粗排 -> 精排

召回方法

单策略召回

通过制定一条规则或者利用一个简单模型（简单直观）

//详见SimilarMovieFlow class
public static List<Movie> candidateGenerator(Movie movie){
    ArrayList<Movie> candidates = new ArrayList<>();
    //使用HashMap去重
    HashMap<Integer, Movie> candidateMap = new HashMap<>();
    //电影movie包含多个风格标签
    for (String genre : movie.getGenres()){
        //召回策略的实现
        List<Movie> oneCandidates = DataManager.getInstance().getMoviesByGenre(genre, 100, "rating"); 
        for (Movie candidate : oneCandidates){
            candidateMap.put(candidate.getMovieId(), candidate);
        }
    }
    //去掉movie本身
    if (candidateMap.containsKey(movie.getMovieId())){
        candidateMap.remove(movie.getMovieId());
    }
    //最终的候选集
    return new ArrayList<>(candidateMap.values());
}

多路召回策略
计算速度和召回率之间进行权衡的结果

风格类型、高分评价、最新上映，这三路召回策略组成的多路召回方法
（其他方法：多线程并行；建立标签、特征索引；建立常用召回集缓存；）

public static List<Movie> multipleRetrievalCandidates(List<Movie> userHistory){
    HashSet<String> genres = new HashSet<>();
    //根据用户看过的电影，统计用户喜欢的电影风格
    for (Movie movie : userHistory){
        genres.addAll(movie.getGenres());
    }
    //根据用户喜欢的风格召回电影候选集
    HashMap<Integer, Movie> candidateMap = new HashMap<>();
    for (String genre : genres){
        List<Movie> oneCandidates = DataManager.getInstance().getMoviesByGenre(genre, 20, "rating");
        for (Movie candidate : oneCandidates){
            candidateMap.put(candidate.getMovieId(), candidate);
        }
    }
    //召回所有电影中排名最高的100部电影
    List<Movie> highRatingCandidates = DataManager.getInstance().getMovies(100, "rating");
    for (Movie candidate : highRatingCandidates){
        candidateMap.put(candidate.getMovieId(), candidate);
    }
    //召回最新上映的100部电影
    List<Movie> latestCandidates = DataManager.getInstance().getMovies(100, "releaseYear");
    for (Movie candidate : latestCandidates){
        candidateMap.put(candidate.getMovieId(), candidate);
    }
    //去除用户已经观看过的电影
    for (Movie movie : userHistory){
        candidateMap.remove(movie.getMovieId());
    }
    //形成最终的候选集
    return new ArrayList<>(candidateMap.values());
}

ElasticSearch 数据库

【非关系型数据库】搜索引擎数据库，内核基于 Lucene 构建，适用 Json 来承载数据，提供 RESTful API

存储结构

Node（节点）
Cluster（集群），一个或多个node
Index（索引），类似于数据库概念
Type（类型），类似于表概念
Document（文档），可被索引的基础信息单元，数据某一行
Field（列），数据某一列
shard（分片），创建一个索引可以指定分成多少个分片
Replication（备份），一个分片可以有多个备份（副本）

技术痛点

倒排序核心算法
ES 支持滚动索引（日期/序号/日期+序号），支持写入时清理策略

查询 DSL

GET /ad/phone/_search
{
    "query":{
        "match":{
            "name":"phone"
        }
    }
}

查询所有

GET /ad/phone/_search
{
  "query": {
    "match_all": {}
  }
}

复合查询

GET /ad/phone/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {
          "name": "phone"
        }}
      ]
      , "must_not": [
        {"match": {
          "color": "red"
        }}
      ]
      , "should": [
        {"match": {
          "price": 5000
        }}
      ]
      , "filter": {
          "term": {
            "label": "phone"
          }
      }
    }
  }
}

ES 数据库做召回引擎

电商实体识别，全称命名实体识别（Named Entity Recognition，简称NER），指对查询词中的具有特定意义的语义实体进行识别。查询分析根据识别的结果，依据实体类型的权重对查询词进行改写，使得召回的文档符合查询的意图。

ES 数据库做质量监控

日志管理与分析（日志采集、存储、查询设计）、安全指标监控（Elastic Stack产品生态）、应用性能监控、Web抓取舆情分析

ES 数据库与机器学习

异常检测类型，无监督学习
数据帧分析，分类与回归

Sure's BLOG

ElasticSearch 数据库与召回组件