继续创造,加快成长!这是我参与「日新方案 10 月更文挑战」的第2天,点击检查活动概况

xDeepFM用改进的DCN替代了DeepFM的FM部分来学习组合特征信息,而FiBiNET则是应用SENET加入了特征权重比NFM,AFM更进了一步。在看两个model前主张对DeepFM, Deep&Cross, AFM,NFM都有简略了解,不熟悉的能够看下文章最终其他model的博客链接。

以下代码针对Dense输入更简单理解模型结构,针对spare输入的代码和完整代码 github.com/DSXiangLi/C…

xDeepFM

模型结构

看xDeepFM的姓名和DeepFM类似都拥有Deep和Linear的部分,只不过把DeepFM中用来学习二阶特征交互的FM部分替换成了CIN(Compressed Interactino Network)。而CIN是在Deep&Cross的DCN进步一步改进的得到。全体模型结构如下

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

咱们重点看下CIN的部分,和paper的notation保持共同,有m个特征,每个特征Embedding是D维,第K层的CIN有HkH_k个unit。CIN第K层的核算分为3个部分分别对应图a-c:

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

  1. 向量两两做element-wise product, 时刻复杂度O(m∗Hk−1∗D)O(m*H_{k-1}*D) 对输入层和第K-1层输出,做element-wise乘积进行两两特征交互,得到m∗Hk−1m*H_{k-1}个D维向量矩阵,假如CIN只要一层,则和FM, NFM,AFM的第一步相同。FM 直接聚合成scaler,NFM沿D进行sum_pooling,而AFM加入Attention沿D进行weighted_pooling。疏忽batch的矩阵dimension变化如下
zk=x0⊙xk−1=(D∗m∗1)⊙(D∗1∗Hk−1)=D∗m∗Hk−1z^k = x^0 \odot x^{k-1} = (D * m* 1) \odot (D * 1* H_{k-1}) = D * m*H_{k-1}
  1. Feature Map,空间复杂度O(Hk∗Hk−1∗m)O(H_k *H_{k-1} *m),时刻复杂度O(Hk∗Hk−1∗m∗D)O(H_k *H_{k-1} *m*D) Wk∈RHk−1∗m∗HkW_k \in R^{H_{k-1}*m *H_k} 是第K层的权重向量,能够理解为沿Embedding做CNN。每个Filter对一切两两乘积的向量进行加权求和得到 1∗D1*D的向量 一共有HkH_k个channel,输出Hk∗DH_k * D的矩阵向量。
wk∙zk=(Hk∗Hk−1∗m)∗(m∗Hk−1∗D)=Hk∗Dw^k \bullet z^k = (H_k *H_{k-1} *m)* (m*H_{k-1}*D) = H_k *D
  1. Sum Pooling CIN对每层的输出沿Dimension进行sum pooling,得到Hk∗1H_k*1的输出,然后把每层输出concat今后作为CIN部分的输出。

CIN每一层的核算如上,T层CIN每一层都是上一次层的输出和第一层的输入进行交互得到更高一阶的交互信息。假设每层维度一样Hk=HH_k=H, CIN 部分全体时刻复杂度是O(TDmH2)O(TDmH^2),空间复杂度来自每层的Filter权重O(TmH2)O(TmH^2)

CIN保存DCN的恣意高阶和参数同享,两个首要差别是

  • DCN是bit-wise,CIN是vector-wise。DCN在做向量乘积时不区分Field,直接对一切Field拼接成的输入(m*D)进行外积。而CIN考虑Field,两两vector进行乘积
  • DCN使用了ResNet因为多项式的核心只用输出最终一层,而CIN则是每层都进行pooling后输出

CIN的规划仍是很奇妙滴,不过。。。吐槽小分队上线: CIN不论是时刻复杂度仍是空间复杂度都比DCN要高,感觉更简单过拟合。至于说vector-wise的向量乘积要比bit-wise的向量乘积要好,这。。。至少bit-wise能够不约束embedding维度共同, 但vector-wise嘛我真实有些理解无能,明白的童鞋能够comment一下

代码完成

def cross_op(xk, x0, layer_size_prev, layer_size_curr, layer, emb_size, field_size):
    # Hamard product: ( batch * D * HK-1 * 1) * (batch * D * 1* H0) -> batch * D * HK-1 * H0
    zk = tf.matmul( tf.expand_dims(tf.transpose(xk, perm = (0, 2, 1)), 3),
                    tf.expand_dims(tf.transpose(x0, perm = (0, 2, 1)), 2))
    zk = tf.reshape(zk, [-1, emb_size, field_size * layer_size_prev]) # batch * D * HK-1 * H0 -> batch * D * (HK-1 * H0)
    add_layer_summary('zk_{}'.format(layer), zk)
    # Convolution with channel = HK: (batch * D * (HK-1*H0)) * ((HK-1*H0) * HK)-> batch * D * HK
    kernel = tf.get_variable(name = 'kernel{}'.format(layer),
                             shape = (field_size * layer_size_prev, layer_size_curr))
    xkk = tf.matmul(zk, kernel)
    xkk = tf.transpose(xkk, perm = [0,2,1]) # batch * HK * D
    add_layer_summary( 'Xk_{}'.format(layer), xkk )
    return xkk
def cin_layer(x0, cin_layer_size, emb_size, field_size):
    cin_output_list = []
    cin_layer_size.insert(0, field_size) # insert field dimension for input
    with tf.variable_scope('Cin_component'):
        xk = x0
        for layer in range(1, len(cin_layer_size)):
            with tf.variable_scope('Cin_layer{}'.format(layer)):
                # Do cross
                xk = cross_op(xk, x0, cin_layer_size[layer-1], cin_layer_size[layer],
                              layer, emb_size, field_size ) # batch * HK * D
                # sum pooling on dimension axis
                cin_output_list.append(tf.reduce_sum(xk, 2)) # batch * HK
    return tf.concat(cin_output_list, axis=1)
@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense_input = tf.feature_column.input_layer(features, dense_feature)
    sparse_input = tf.feature_column.input_layer(features, sparse_feature)
    # Linear part
    with tf.variable_scope('Linear_component'):
        linear_output = tf.layers.dense( sparse_input, units=1 )
        add_layer_summary( 'linear_output', linear_output )
    # Deep part
    dense_output = stack_dense_layer( dense_input, params['hidden_units'],
                               params['dropout_rate'], params['batch_norm'],
                               mode, add_summary=True )
    # CIN part
    emb_size = dense_feature[0].variable_shape.as_list()[-1]
    field_size = len(dense_feature)
    embedding_matrix = tf.reshape(dense_input, [-1, field_size, emb_size]) # batch * field_size * emb_size
    add_layer_summary('embedding_matrix', embedding_matrix)
    cin_output = cin_layer(embedding_matrix, params['cin_layer_size'], emb_size, field_size)
    with tf.variable_scope('output'):
        y = tf.concat([dense_output, cin_output,linear_output], axis=1)
        y = tf.layers.dense(y, units= 1)
        add_layer_summary( 'output', y )
    return y

FiBiNET

模型结构

看FiBiNET前能够先了解下Squeeze-and-Excitation Network,感兴趣能够看下这篇博客Squeeze-and-Excitation Networks。

FiBiNET的首要创新是应用SENET学习每个特征的重要性,加权得到新的Embedding矩阵。在FiBiNET之前,AFM,PNN,DCN和上面的xDeepFM都是在特征交互之后才用attention, 加权等方法学习特征交互的权重,而FiBiNET在保存这部分的同时,在Embedding部分就考虑特征本身的权重。模型结构如下

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

原始Embedding,和经过SENET调整过权重的新Embedding,在Bilinear-interaction层学习二阶交互特征,拼接后,再经过MLP进一步学习高阶特征。和paper notation保持共同(啊啊啊咱们能不能统一下notation搞的我自己看自己的注释都蒙圈),f个特征,k维embedding

SENET层

SENET层学习每个特征的权重对Embedding进行加权,分为以下3步

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

  1. Squeeze 把f∗kf*k的Embedding矩阵紧缩成f∗1f*1, 紧缩方法不固定,SENET原paper用的max_pooling,作者用的sum_pooling,感觉这儿紧缩方法应该取决于Embedding的信息表达
E=[e1,…,ef]Z=[z1,…,zf]zi=Fsqueeze(ei)=1k∑i=1KeiE = [e_1,…,e_f] \\ Z = [z_1,…,z_f] \\ z_i = F_{squeeze}(e_i) = \frac{1}{k}\sum_{i=1}^K e_i \\
  1. Excitation Excitation是一个两层的全衔接层,经过先降维再升维的方法过滤一些无用特征,降维的起伏经过额定变量rr来控制,第一层权重W1∈Rf∗f/rW_1 \in R^{f*f/r},第二层权重W2∈Rf/r∗fW_2 \in R^{f/r*f}。这儿r越高,紧缩的起伏越高,最终的权重会更集中,反之会更涣散。
A=2(W2⋅1(W1⋅Z))A = \sigma_2(W_2\sigma_1(W_1Z))
  1. Re-weight 最终一步便是用Excitation得到的每个特征的权重对Embedding进行加权得到新Embedding
Enew=FReweight(A,E)=[a1⋅e1,…,af⋅ef]E_{new} = F_{Reweight}(A,E) = [a_1e_1, …,a_fe_f ]

在收入数据集进步行测验,r=2时会有46%的embedding特征权重为0,所以SENET会在特征交互前先过滤部分对target无用的特征来增加有效特征的权重

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

Bilinear-Interaction层

作者提出内积和element-wise乘积都不足以捕捉特征交互信息,因此进一步引入权重W,以下面的方法进行特征交互

vi⋅W⊙vjv_i W \odot v_j

CTR代码实现6-深度ctr模型 后浪 xDeepFM/FiBiNET

其间W有三种挑选,能够一切特征交互同享一个权重矩阵(Field-All),或许每个特征和其他特征的交互同享权重(Field-Each), 再或许每个特征交互一个权重(Field-Interaction) 具体的好坏感觉需求casebycase来试,不过一般仍是照着数据越少参数越少的逻辑来整。

原始Embedding和调整权重后的Embedding在Bilinear-Interaction学习交互特征后,拼接成shallow 层,再经过全衔接层来学习更高阶的特征交互。后边的属于惯例操作这儿就不再细说。

咱们不去吐槽FiBiNET能够加入wide&deep框架来捕捉低阶特征信息和恣意高阶信息,更多把FiBiNET提供的SENET特征权重的思路放到自己的工具箱中就好。

代码完成

def Bilinear_layer(embedding_matrix, field_size, emb_size, type, name):
    # Bilinear_layer: combine inner and element-wise product
    interaction_list = []
    with tf.variable_scope('BI_interaction_{}'.format(name)):
        if type == 'field_all':
            weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                      name='Bilinear_weight_{}'.format(name) )
        for i in range(field_size):
            if type == 'field_each':
                weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                          name='Bilinear_weight_{}_{}'.format(i, name) )
            for j in range(i+1, field_size):
                if type == 'field_interaction':
                    weight = tf.get_variable( shape=(emb_size, emb_size), initializer=tf.truncated_normal_initializer(),
                                          name='Bilinear_weight_{}_{}_{}'.format(i,j, name) )
                vi = tf.gather(embedding_matrix, indices = i, axis =1, batch_dims =0, name ='v{}'.format(i)) # batch * emb_size
                vj = tf.gather(embedding_matrix, indices = j, axis =1, batch_dims =0, name ='v{}'.format(j)) # batch * emb_size
                pij = tf.matmul(tf.multiply(vi,vj), weight) # bilinear : vi * wij \odot vj
                interaction_list.append(pij)
        combination = tf.stack(interaction_list, axis =1 ) # batch * emb_size * (Field_size * (Field_size-1)/2)
        combination = tf.reshape(combination, shape = [-1, int(emb_size * (field_size * (field_size-1) /2)) ]) # batch * ~
        add_layer_summary( 'bilinear_output', combination )
    return combination
def SENET_layer(embedding_matrix, field_size, emb_size, pool_op, ratio):
    with tf.variable_scope('SENET_layer'):
        # squeeze embedding to scaler for each field
        with tf.variable_scope('pooling'):
            if pool_op == 'max':
                z = tf.reduce_max(embedding_matrix, axis=2) # batch * field_size * emb_size -> batch * field_size
            else:
                z = tf.reduce_mean(embedding_matrix, axis=2)
            add_layer_summary('pooling scaler', z)
        # excitation learn the weight of each field from above scaler
        with tf.variable_scope('excitation'):
            z1 = tf.layers.dense(z, units = field_size//ratio, activation = 'relu')
            a = tf.layers.dense(z1, units= field_size, activation = 'relu') # batch * field_size
            add_layer_summary('exciitation weight', a )
        # re-weight embedding with weight
        with tf.variable_scope('reweight'):
            senet_embedding = tf.multiply(embedding_matrix, tf.expand_dims(a, axis = -1)) # (batch * field * emb) * ( batch * field * 1)
            add_layer_summary('senet_embedding', senet_embedding) # batch * field_size * emb_size
        return senet_embedding
@tf_estimator_model
def model_fn_dense(features, labels, mode, params):
    dense_feature, sparse_feature = build_features()
    dense_input = tf.feature_column.input_layer(features, dense_feature)
    sparse_input = tf.feature_column.input_layer(features, sparse_feature)
    # Linear part
    with tf.variable_scope('Linear_component'):
        linear_output = tf.layers.dense( sparse_input, units=1 )
        add_layer_summary( 'linear_output', linear_output )
    field_size = len(dense_feature)
    emb_size = dense_feature[0].variable_shape.as_list()[-1]
    embedding_matrix = tf.reshape(dense_input, [-1, field_size, emb_size])
    # SENET_layer to get new embedding matrix
    senet_embedding_matrix = SENET_layer(embedding_matrix, field_size, emb_size,
                                         pool_op = params['pool_op'], ratio= params['senet_ratio'])
    # combination layer & BI_interaction
    BI_org = Bilinear_layer(embedding_matrix, field_size, emb_size, type = params['bilinear_type'], name = 'org')
    BI_senet = Bilinear_layer(senet_embedding_matrix, field_size, emb_size, type = params['bilinear_type'], name = 'senet')
    combination_layer = tf.concat([BI_org, BI_senet] , axis =1)
    # Deep part
    dense_output = stack_dense_layer(combination_layer, params['hidden_units'],
                               params['dropout_rate'], params['batch_norm'],
                               mode, add_summary=True )
    with tf.variable_scope('output'):
        y = dense_output + linear_output
        add_layer_summary( 'output', y )
    return y

Ref

  1. Jianxun Lian, 2018, xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems
  2. Tongwen Huang, 2019, FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction
  3. Jie Hu, 2017, Squeeze-and-Excitation Networks
  4. zhuanlan.zhihu.com/p/72931811
  5. zhuanlan.zhihu.com/p/79659557
  6. zhuanlan.zhihu.com/p/57162373
  7. github.com/qiaoguan/de…