Pinterest has introduced PinnerFormer, a state-of-the-art sequence modeling approach for learning user representations that power personalized recommendations on their platform. PinnerFormer aims to predict users' long-term engagement with Pins based on their recent actions, enabling Pinterest to surface the most relevant and engaging content to over 400 million monthly users.
Let's clarify the input and output shapes of the PinnerFormer model step by step:
To summarize: - The input consists of the user action sequence, positive Pin embeddings, and negative Pin embeddings. - The output includes the user embeddings for each position in the user's action sequence. - The Pin embeddings are learned separately by the Pin embedding layer and are not directly part of the model's output but are used during training and inference.
In the PinnerFormer model, the selection of training examples is guided by the training objective, which aims to predict positive user engagement. The model considers three forms of positive engagement: Repins, Closeups, and Clicks. Instead of learning task-specific heads for each engagement type, the model learns a single embedding in a multi-task manner, enabling it to effectively retrieve different types of positive engagement.
The authors explore four training objectives for selecting pairs of user embeddings (u_i) and pin embeddings (p_i):
During training, the model selects pairs of user embeddings (u_i) and pin embeddings (p_i) based on the chosen training objective. For example: - In the Next Action Prediction objective, the model pairs the user embedding (u_i) with the embedding of the next positive action (p_i) in the user's sequence. - In the All Action Prediction objective, the model pairs the final user embedding (e_1) with the embeddings of all positive actions the user will take in the next K days. - In the Dense All Action Prediction objective, the model pairs user embeddings at randomly selected indices (e_s_i) with the embeddings of randomly selected positive actions from the next K days.
The choice of training objective determines how the user embeddings and pin embeddings are paired for training. The authors aim to learn a single embedding that can effectively retrieve different types of positive engagement, without explicitly weighting different engagement types differently in the loss computation.
By selecting training examples based on these objectives, the PinnerFormer model learns to predict positive user engagement over a longer time horizon, capturing the user's interests and preferences beyond just the next immediate action. The Dense All Action Prediction objective, in particular, provides a dense signal for training by pairing user embeddings at different positions with positive actions from the future time window.
PinnerFormer leverages a rich set of features to capture various aspects of user behavior and the properties of interacted Pins. These features can be categorized into those that require learning and those that are directly encoded.
Action Type Embedding: The type of action performed by the user (e.g., click, save, share) is encoded using a learnable embedding table. The model learns appropriate embeddings for each action type during training.
Surface Embedding: The surface or context where the action occurred (e.g., home feed, search results) is encoded using a learnable embedding table. The model learns appropriate embeddings for each surface during training.
PinSage Embedding: Each user action is associated with a 256-dimensional PinSage embedding, which represents the Pin involved in the action. These embeddings are pre-computed and fixed, and do not need to be learned by PinnerFormer.
Action Duration: The duration of the user's action is directly encoded as a single scalar value using a logarithmic transformation: log(duration). This feature does not involve any learnable parameters.
Timestamp: The raw absolute timestamp of when the action occurred is directly included as a feature, without any learning required.
Time since Latest Action: The time elapsed since the user's most recent action is computed and directly included as a feature, without involving any learnable parameters.
Time Gap between Actions: The time gap between consecutive actions in the user's sequence is computed and directly included as a feature, without requiring any learning.
Periodic Time Encoding: For each time-related feature (timestamp, time since latest action, and time gap between actions), a periodic encoding is applied using sine and cosine transformations with fixed periods. The periods are fixed, but the phase shifts used in the encoding are learnable parameters.
All these features are concatenated into a single input vector, which is then fed into the PinnerFormer model to learn user representations. The combination of learnable embeddings and directly encoded features allows PinnerFormer to effectively capture user preferences and behavior patterns.
The loss function in PinnerFormer is designed to learn user embeddings and Pin embeddings that capture the user's preferences and the characteristics of the Pins. The goal is to optimize the embeddings such that the user embeddings are similar to the embeddings of the Pins the user is likely to engage with.
PinnerFormer uses a variant of the softmax loss, specifically the sampled softmax loss with a log-Q correction. Here's how it works:
The loss function can be formulated as follows:
loss = -log(exp(s(u, p)) / (exp(s(u, p)) + Σ exp(s(u, n) - log(Q(n)))))
Where:
- u
is the user embedding (a row from the matrix E)
- p
is the embedding of a positive Pin
- n
is the embedding of a negative Pin
- s(u, p)
and s(u, n)
are the similarity scores between the user embedding and the positive/negative Pin embeddings (e.g., dot product)
- Q(n)
is the probability of the negative Pin n
being sampled
During training, the model learns to maximize the similarity between the user embedding and the embeddings of the positive Pins while minimizing the similarity with the negative Pins. The log-Q correction term helps to adjust for the sampling bias and gives more importance to less frequently sampled negative examples.
By minimizing this loss function, PinnerFormer learns user embeddings and Pin embeddings that capture the user's preferences and the characteristics of the Pins. The learned embeddings can then be used for personalized recommendation by comparing the similarity between the user embedding and the Pin embeddings.
A key innovation in PinnerFormer is its focus on predicting long-term user engagement over a multi-day horizon, rather than just the next immediate action. This is achieved through a novel dense all-action loss that aims to predict a variety of future user actions on Pins, such as clicks, close-ups, and repins. By training on this diverse set of engagement signals in a multi-task manner, PinnerFormer learns rich user representations that capture evolving user interests over extended time periods.
Offline experiments show that this long-term engagement objective, combined with the dense all-action loss and multi-task training, significantly improves the quality of the learned user embeddings compared to traditional next-action prediction approaches. PinnerFormer can surface Pins that align with a user's interests over a 2-week horizon, even when the embeddings are only updated once per day.
Another important aspect of PinnerFormer is its design for offline, batch inference rather than real-time updates. User embeddings are updated daily based on the past day's actions, then served to power various recommendation and ranking systems across Pinterest.
This offline inference setup greatly simplifies the deployment of PinnerFormer in Pinterest's infrastructure. It avoids the need for streaming pipelines to constantly update user embeddings in real-time with each new action, as well as the challenges of managing mutable embedding states. This enables the use of larger, more expressive PinnerFormer models while keeping infrastructure cost and complexity under control.
Interestingly, Pinterest's experiments demonstrate that PinnerFormer's long-term engagement objective also makes it more robust to the staleness that comes with daily batch updates. The performance gap between the daily updated PinnerFormer embeddings and fully real-time embeddings is much smaller compared to sequence models trained on next-action prediction.
PinnerFormer is trained using a variant of the sampled softmax loss, which contrasts the similarity between a user's embedding and their future engaged Pins' embeddings (positive examples) against randomly sampled negative Pin embeddings.
A key aspect of the training process is the use of "in-batch negatives". In addition to random negatives, the positive examples from other users within the same training batch are used as negatives for the current user. This serves several purposes:
However, since popular Pins are more likely to appear as in-batch negatives, the model applies a log-Q correction term to adjust for this sampling bias. The overall loss combines both in-batch and random negatives to balance informative negative sampling with broad coverage of the Pin corpus.
This sampled softmax loss with in-batch negatives enables PinnerFormer to efficiently learn high-quality user embeddings that capture long-term engagement likelihood over a large set of Pins.
PinnerFormer has been successfully deployed in production to power multiple recommendation systems at Pinterest, including the home feed and ads ranking. It serves as a single, unified user representation that captures users' interests across various surfaces.
The introduction of PinnerFormer has driven substantial gains in Pinterest's key engagement and retention metrics. For example, using PinnerFormer in the home feed pin ranking model led to a 1% increase in total time spent on the platform and a 0.12% lift in weekly active users. This demonstrates how learning high-quality user representations that capture long-term interests can meaningfully improve the relevance and quality of recommendations.
PinnerFormer showcases the impact that a well-designed sequence modeling approach can have on enhancing user representations and recommendations in a large-scale visual discovery platform like Pinterest. The focus on long-term engagement prediction, enabled by the dense all-action loss and offline batch inference design, allows PinnerFormer to learn rich and robust user embeddings that drive gains in key product metrics.
As recommendation systems continue to advance, PinnerFormer underscores the importance of thinking beyond next-action prediction and designing approaches that can effectively capture users' long-term and multi-faceted interests. Its success at Pinterest serves as a valuable case study for leveraging sequence modeling to enhance user understanding and deliver more relevant, engaging experiences.
PinnerFormer: Sequence Modeling for User Representation at Pinterest
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest
Created 2024-03-11T17:27:14-07:00, updated 2024-03-12T16:33:06-07:00 · History · Edit