腾讯算法大赛|【复赛前排分享(二)】收好这份王牌优化指南,助你轻松上分无压力

2020腾讯广告算法大赛复赛已经落幕,决赛答辩终极一战即将在8月3日14:00深圳腾讯滨海大厦举行,了解决赛详情并预约直播观赛,请点击:
《决赛来袭!十强战队齐聚,终极一战拉开帷幕!》
外部赛场战况激烈,腾讯公司也联合码客开启了面向员工的内部赛道。其中夺得复赛内部榜第二名的大雄团队,受邀来到本次前排分享会,与大家分享他们的解题秘诀。在竞赛过程中,他们的答题策略透露出优秀的时间管理能力和丰富的实战经验。如何在保证优化效果的前提下减轻训练压力?听听他们怎么说。
01 赛题解读
本届腾讯广告算法大赛的题目是用户画像,即根据用户的广告点击行为以及广告相应的信息对用户的年龄和性别进行预测。
02 数据字段

  • time: 天粒度时间 nunique: 91
  • user_id: 从1到N随机编号生成 nunique: 400w
  • creative_id: 用户点击的广告素材id nunique: 2618159
  • click_times: 当天该用户点击该广告素材的次数 nunique: 54
  • ad_id: 该素材所归属的广告id,每个广告可能包含多个可展示的素材 nunique: 2379475
  • product_id: 该广告中所宣传的产品id nunique: 34111
  • product_category: 该广告中所宣传的产品的类别id nunique: 18
  • advertiser_id: 广告主的id nunique: 52861
  • industry: 广告主所属行业的id nunique: 326
  • age: 用户年龄段[1-10]
  • gender: 用户性别[1,2]
03 模型输入
最终方案只使用了五个id序列作为模型输入:
‘creative_id’
‘ad_id’
‘advertiser_id’
‘product_id’
‘industry’
由于只能在非工作时间参赛,我放弃了特征构造,安心每天挂机调输入调结构。其实最终解决方案并不复杂,只要把握好试错时间成本,相信大家都能得到理想的结果,下面我就针对这两部分,分别说说我一路调优下来的感受。
模型输入直接决定了模型的天花板,我尝试了多种方案后总结出:对输入影响最直接的就是有效词的选择、word2vec的词向量生成阶段以及输入的shuffle。词有效性的选择既决定了训练是否有效,又决定了词向量矩阵的内存消耗,在主办方没提供TI-ONE的条件下还是很有效地缓解了内存不足的问题。这里只出现一次的id将被视为不起效,将其与训练测试集不相交的id统一起来,视为一个id,会大大减轻训练压力,对训练效果也没有影响。
differ = set(train[col].unique()).symmetric_difference(set(test[col].unique())) #获取不同的id common = set(train[col].unique()) and (set(test[col].unique())) #获取相同的id for v in val_cnt[val_cnt == 1].index:# 出现一次的统一起来当成一个id id_map[v] = 0 for v in differ:# 训练集测试集不一样的也统一起来当一个id id_map[v] = 0 for i, v in enumerate(common):# 相同的按index累加当id id_map[v] = i + 1

w2v训练参数最终采用了skip-gram形式,关键参数为min_count=1,size=256,window=10,当然size和window不太有普适性,多跑几个尝试一下即可。
model = models.Word2Vec(list_d,sg=1,min_count=1,size=256,window=10,workers=48,iter=10) We = [] if '0' in model.wv: for i in tqdm(range(len(model.wv.index2word))): We.append(model.wv[str(i)].reshape((1,-1))) else: We.append(np.zeros((1,128))) for i in tqdm(range(len(model.wv.index2word))): We.append(model.wv[str(i+1)].reshape((1,-1))) We = np.vstack(We)

输入构造这里有正序、逆序、随机shuffle、click_times加倍等几种操作,click_times加倍后也要相应地适当增加sequence_length,取95%序列长度即可。
for col in tqdm(['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry', 'product_category']): list_d = pd.read_pickle('./idlist/{}_list.pkl'.format(col)) We = np.load('./w2v_256_10/{}_embedding_weight.npy'.format(col)) We = np.vstack([We, np.zeros(config.embeddingSize)]) list_d = list(list_d) for i in range(len(list_d)): ret = [] for j in range(len(list_d[i])): ret += [list_d[i][j]] * click_times[i][j] list_d[i] = retif len(list_d[i]) > config.sequenceLength: list_d[i] = list_d[i][:config.sequenceLength] else: list_d[i] += [len(We) - 1] * (config.sequenceLength - len(list_d[i])) list_d = np.array(list_d) list_d = list_d.astype(np.int32)# 减少内存使用量class DataSequence(Sequence): def __init__(self, xs, y, batch_size=128, shuffle=True): self.xs = xs self.y = y self.batch_size = batch_size self.size = xs[0].shape[0] self.shuffle = shuffle if self.shuffle: state = np.random.get_state() for x in self.xs: np.random.set_state(state) np.random.shuffle(x) np.random.set_state(state) np.random.shuffle(self.y)def __len__(self): return int(np.ceil(self.size / float(self.batch_size)))def __getitem__(self, idx): batch_idx = np.arange(idx * self.batch_size, min((idx + 1) * self.batch_size, self.size)) batch_xs = [x[batch_idx] for x in self.xs] batch_y = self.y[batch_idx] # shuffle if self.shuffle: x = [] for i in range(len(batch_xs)): x.append(batch_xs[i].copy()) for i in range(len(x[0])): p = np.random.rand() if p < 0.8: state = np.random.get_state() for j in range(len(batch_xs)): np.random.set_state(state) np.random.shuffle(x[j][i]) batch_xs = x return batch_xs, batch_y

04 模型结构
模型结构方面尝试了LSTM、CNN_Inception结构,CNN最终也能到1.47左右的水平,transformer结合LSTM效果也不错,最终没调试出超过纯LSTM。当然也可以只是用transformer模型,但是我的效果并不好,有兴趣的可以参考CyberZHG/Hugging Face开源的实现调调看。个人感觉针对本题数据,**少头优于多头,多层优于少层。**可以改一下只用QK,放弃dense层,弄成个精简版的multi-head。最终我是实现了keras和torch两个版本的模型框架(solo参赛为了最终融合只能想想办法了),模型结构如下:
##LSTM keras-version def LSTM(config, n_cls=10): cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry'] n_in = len(cols) inputs = [] outputs = [] max_len = [] for i in range(n_in): We = np.load('./w2v_256_10/{}_embedding_weight.npy'.format(cols[i])) We = np.vstack([We, np.zeros(config.embeddingSize)]) inp = Input(shape=(config.sequenceLength,), dtype="int32") x = Embedding(We.shape[0], We.shape[1], weights=[We], trainable=False)(inp) inputs.append(inp) outputs.append(x) del We gc.collect()embedding_model = Model(inputs, outputs)inputs = [] for i in range(n_in): inp = Input(shape=(config.sequenceLength, config.embeddingSize,)) inputs.append(inp)all_input = Concatenate()(inputs) all_input = SpatialDropout1D(0.2)(all_input) lstm1 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(all_input) lstm2 = Bidirectional(CuDNNLSTM(256, return_sequences=True))(lstm1) pool_1 = GlobalMaxPooling1D()(lstm1) pool_2 = GlobalMaxPooling1D()(lstm2) pool = Concatenate()([pool_1, pool_2]) pool = Dropout(0.2)(pool)outputs = Dense(n_cls, activation='softmax')(pool) lstm_model = Model(inputs, outputs) model = Model(embedding_model.inputs, lstm_model(embedding_model.outputs))return model, lstm_model##LSTM Torch-version class LSTM(nn.Module): def __init__(self): super(LSTM, self).__init__() emb_outputs = []cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry'] n_in = len(cols) for i in range(n_in): We = np.load('./w2v_256_120/{}_embedding_weight.npy'.format(cols[i])) We = np.vstack([We, np.zeros(256)]) embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1, _weight=t.FloatTensor(We)) for p in embed.parameters(): p.requires_grad = False emb_outputs.append(embed)for i in range(n_in): We = np.load('./w2v_128_60/{}_embedding_weight.npy'.format(cols[i])) We = np.vstack([We, np.zeros(128)]) embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1, _weight=t.FloatTensor(We)) for p in embed.parameters(): p.requires_grad = False emb_outputs.append(embed) del We gc.collect()self.encoders = nn.ModuleList(emb_outputs) self.emb_drop = nn.Dropout(p=0.2) self.lstm = nn.LSTM(input_size=(256 + 128) * 5, hidden_size=384, num_layers=2, bias=True, batch_first=True, dropout=0.2, bidirectional=True) self.max_pool = nn.MaxPool1d(kernel_size=2, stride=2) self.fc = nn.Sequential(nn.Linear(384, n_cls)) self.fc_drop = nn.Dropout(p=0.2)def forward(self, xs): inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)] x = t.cat(inp, 2) x = self.emb_drop(x) x = self.lstm(x)[0] x = self.max_pool(x) x = t.max(x, dim=1)[0] x = self.fc_drop(x) logits = self.fc(x) return logits##CNN_Inception Torch-verison class Inception(nn.Module): def __init__(self,cin,co,relu=True,norm=True): super(Inception, self).__init__() assert(co%4==0) cos=[co//4]*4 self.activa=nn.Sequential() if norm:self.activa.add_module('norm',nn.BatchNorm1d(co)) if relu:self.activa.add_module('relu',nn.ReLU(True)) self.branch1 =nn.Sequential(OrderedDict([ ('conv1', nn.Conv1d(cin,cos[0], 1,stride=1)), ])) self.branch2 =nn.Sequential(OrderedDict([ ('conv1', nn.Conv1d(cin,cos[1], 1)), ('norm1', nn.BatchNorm1d(cos[1])), ('relu1', nn.ReLU(inplace=True)), ('conv3', nn.Conv1d(cos[1],cos[1], 3,stride=1,padding=1)), ])) self.branch3 =nn.Sequential(OrderedDict([ ('conv1', nn.Conv1d(cin,cos[2], 3,padding=1)), ('norm1', nn.BatchNorm1d(cos[2])), ('relu1', nn.ReLU(inplace=True)), ('conv3', nn.Conv1d(cos[2],cos[2], 5,stride=1,padding=2)), ])) self.branch4 =nn.Sequential(OrderedDict([ #('pool',nn.MaxPool1d(2)), ('conv3', nn.Conv1d(cin,cos[3], 3,stride=1,padding=1)), ])) def forward(self,x): branch1=self.branch1(x) branch2=self.branch2(x) branch3=self.branch3(x) branch4=self.branch4(x) result=self.activa(t.cat((branch1,branch2,branch3,branch4),1)) return resultclass CNN(nn.Module): def __init__(self): super(CNN, self).__init__() emb_outputs = []cols = ['creative_id', 'ad_id', 'advertiser_id', 'product_id', 'industry'] n_in = len(cols) for i in range(n_in): We = np.load('./w2v_256_120/{}_embedding_weight.npy'.format(cols[i])) We = np.vstack([We, np.zeros(256)]) embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1, _weight=t.FloatTensor(We)) for p in embed.parameters(): p.requires_grad = False emb_outputs.append(embed)for i in range(n_in): We = np.load('./w2v_128_60/{}_embedding_weight.npy'.format(cols[i])) We = np.vstack([We, np.zeros(128)]) embed = nn.Embedding(num_embeddings=We.shape[0], embedding_dim=We.shape[1], padding_idx=len(We) - 1, _weight=t.FloatTensor(We)) for p in embed.parameters(): p.requires_grad = False emb_outputs.append(embed) del We gc.collect()self.encoders = nn.ModuleList(emb_outputs) self.emb_drop = nn.Dropout(p=0.2) self.embed_conv = nn.Sequential( Inception(1920, 1024),# (batch_size,64,opt.title_seq_len)->(batch_size,32,(opt.title_seq_len)/2) Inception(1024, 1024), # nn.MaxPool1d(opt.title_seq_len) ) self.fc = nn.Sequential( nn.Linear(1024 * 2, 1024), nn.BatchNorm1d(1024), nn.ReLU(inplace=True), nn.Dropout(p=0.2), nn.Linear(1024, n_cls) )def forward(self, xs): inp = [self.encoders[i](x) for i, x in enumerate(xs)] + [self.encoders[i + 5](x) for i, x in enumerate(xs)] x = t.cat(inp, 2) x = self.emb_drop(x) x = self.embed_conv(x.permute(0, 2, 1)) x = t.max(x.permute(0, 2, 1), dim=1)[0] logits = self.fc(x) return logits

05 结果
最优单模在复赛A榜约为1.475x,B榜经过一顿融合到了1.479952,还是差点上1.48,内部榜第二,外部榜第十四,有点遗憾。
感谢分享,大雄团队的高效风格真是让人印象深刻。而每支队伍都有着自己的特色,即将参加决赛的选手们,期待你们的风采!
8月3日14:00腾讯广告算法大赛决赛即将启幕,算法王者巅峰对决,为你带来算法与技术激烈碰撞的盛筵。快点击【文末链接】,扫描报名页面底部二维码,预约线上直播观赛吧!
同时,欢迎选手们到“官网—个人信息”页面上传简历。加入腾讯,就趁现在!
【腾讯算法大赛|【复赛前排分享(二)】收好这份王牌优化指南,助你轻松上分无压力】扫码加入大赛官方QQ群
或搜索群号:1094257162
和小伙伴一起解锁更多内容
腾讯算法大赛|【复赛前排分享(二)】收好这份王牌优化指南,助你轻松上分无压力
文章图片

点击下方链接,预约直播观赛:
2020腾讯广告算法大赛决赛观赛报名页

    推荐阅读