PinnerFormer: Sequence Modeling for User Representation at Pinterest

Pinterest has introduced PinnerFormer, a state-of-the-art sequence modeling approach for learning user representations that power personalized recommendations on their platform. PinnerFormer aims to predict users' long-term engagement with Pins based on their recent actions, enabling Pinterest to surface the most relevant and engaging content to over 400 million monthly users.

Input and Output Shapes

Let's clarify the input and output shapes of the PinnerFormer model step by step:

Input

User Action Sequence

Shape: (M, Din)
M: Number of actions in the user's sequence
Din: Dimensionality of each action's input representation
The input matrix A represents the user's action sequence, where each row corresponds to an action and includes features such as the PinSage embedding, action type, timestamp, etc.

Positive Pin Embeddings

Shape: (P, D)
P: Number of positive Pins for the user
D: Dimensionality of the Pin embeddings (same as user embeddings)
The positive Pin embeddings are used during the training process to compute the loss and update the model's parameters.

Negative Pin Embeddings

Shape: (N, D)
N: Number of negative Pins sampled for the user
D: Dimensionality of the Pin embeddings (same as user embeddings)
The negative Pin embeddings are also used during training to compute the loss and update the model's parameters.

Output

User Embeddings:

Shape: (M, D)
M: Number of actions in the user's sequence (same as input)
D: Dimensionality of the user embeddings
The output matrix E represents the user embeddings, where each row corresponds to a specific position in the user's action sequence.

Pin Embeddings:

Shape: (Total_Pins, D)
Total_Pins: Total number of unique Pins in the dataset
D: Dimensionality of the Pin embeddings (same as user embeddings)
The Pin embeddings are generated by the Pin embedding layer (MLP) for each unique Pin in the dataset.
The Pin embeddings are not directly part of the model's output but are learned and stored separately.

Training Process

The input user action sequence (M, Din) is passed through the transformer layers to generate the user embeddings (M, D).
The positive and negative Pin embeddings (P, D) and (N, D) are used to compute the sampled softmax loss with log-Q correction.
The loss is backpropagated to update the parameters of the transformer layers and the Pin embedding layer.

Inference

The input user action sequence (M, Din) is passed through the trained transformer layers to generate the user embeddings (M, D).
The user embeddings can be used for downstream tasks such as personalized recommendation by comparing them with the learned Pin embeddings.

To summarize: - The input consists of the user action sequence, positive Pin embeddings, and negative Pin embeddings. - The output includes the user embeddings for each position in the user's action sequence. - The Pin embeddings are learned separately by the Pin embedding layer and are not directly part of the model's output but are used during training and inference.

Selecting Training Examples

In the PinnerFormer model, the selection of training examples is guided by the training objective, which aims to predict positive user engagement. The model considers three forms of positive engagement: Repins, Closeups, and Clicks. Instead of learning task-specific heads for each engagement type, the model learns a single embedding in a multi-task manner, enabling it to effectively retrieve different types of positive engagement.

The authors explore four training objectives for selecting pairs of user embeddings (u_i) and pin embeddings (p_i):

Next Action Prediction

This objective predicts the next positive engagement (A_T+1) given the user's sequence of actions {A_T, A_T-1, ..., A_T-M+1}.
It is intuitive for a realtime sequence model, as A_T is always the most recent action taken by the user.
SASRec extends this objective by predicting the next action at every step in the model, rather than only the most recent positive action.
In the PinnerFormer experiments, the authors slightly modify this objective to allow only positive actions to contribute to the model's loss.

All Action Prediction:

This objective predicts all positive actions a user will take over the next K days using the final user embedding e_1.
It aims to capture the user's longer-term interests rather than focusing solely on the next immediate action.
For computational tractability, the authors randomly sample up to 32 actions per user in the K-day time window.

Dense All Action Prediction

This objective extends the All Action Prediction objective by selecting a set of random indices {s_i} and predicting a randomly selected positive action from the set of all positive actions over the next K days for each user embedding e_s_i.
It applies causal masking to the transformer's self-attention block, allowing each action to attend only to past or present actions, but not future actions.
To decrease memory usage, the model predicts one positive action for each e_s_i instead of all positive actions.

SASRec:

This objective is similar to the Next Action Prediction objective but predicts the next action at every step in the model, as proposed in the SASRec paper.

During training, the model selects pairs of user embeddings (u_i) and pin embeddings (p_i) based on the chosen training objective. For example: - In the Next Action Prediction objective, the model pairs the user embedding (u_i) with the embedding of the next positive action (p_i) in the user's sequence. - In the All Action Prediction objective, the model pairs the final user embedding (e_1) with the embeddings of all positive actions the user will take in the next K days. - In the Dense All Action Prediction objective, the model pairs user embeddings at randomly selected indices (e_s_i) with the embeddings of randomly selected positive actions from the next K days.

The choice of training objective determines how the user embeddings and pin embeddings are paired for training. The authors aim to learn a single embedding that can effectively retrieve different types of positive engagement, without explicitly weighting different engagement types differently in the loss computation.

By selecting training examples based on these objectives, the PinnerFormer model learns to predict positive user engagement over a longer time horizon, capturing the user's interests and preferences beyond just the next immediate action. The Dense All Action Prediction objective, in particular, provides a dense signal for training by pairing user embeddings at different positions with positive actions from the future time window.

Features Used in PinnerFormer

PinnerFormer leverages a rich set of features to capture various aspects of user behavior and the properties of interacted Pins. These features can be categorized into those that require learning and those that are directly encoded.

Features that require learning

Action Type Embedding: The type of action performed by the user (e.g., click, save, share) is encoded using a learnable embedding table. The model learns appropriate embeddings for each action type during training.
Surface Embedding: The surface or context where the action occurred (e.g., home feed, search results) is encoded using a learnable embedding table. The model learns appropriate embeddings for each surface during training.

Features that do not require learning:

PinSage Embedding: Each user action is associated with a 256-dimensional PinSage embedding, which represents the Pin involved in the action. These embeddings are pre-computed and fixed, and do not need to be learned by PinnerFormer.
Action Duration: The duration of the user's action is directly encoded as a single scalar value using a logarithmic transformation: log(duration). This feature does not involve any learnable parameters.
Timestamp: The raw absolute timestamp of when the action occurred is directly included as a feature, without any learning required.
Time since Latest Action: The time elapsed since the user's most recent action is computed and directly included as a feature, without involving any learnable parameters.
Time Gap between Actions: The time gap between consecutive actions in the user's sequence is computed and directly included as a feature, without requiring any learning.
Periodic Time Encoding: For each time-related feature (timestamp, time since latest action, and time gap between actions), a periodic encoding is applied using sine and cosine transformations with fixed periods. The periods are fixed, but the phase shifts used in the encoding are learnable parameters.

All these features are concatenated into a single input vector, which is then fed into the PinnerFormer model to learn user representations. The combination of learnable embeddings and directly encoded features allows PinnerFormer to effectively capture user preferences and behavior patterns.

Loss Function in PinnerFormer

The loss function in PinnerFormer is designed to learn user embeddings and Pin embeddings that capture the user's preferences and the characteristics of the Pins. The goal is to optimize the embeddings such that the user embeddings are similar to the embeddings of the Pins the user is likely to engage with.

PinnerFormer uses a variant of the softmax loss, specifically the sampled softmax loss with a log-Q correction. Here's how it works:

Positive and Negative Examples

For each user, the model considers the Pins the user has actually engaged with (e.g., clicked, saved, or interacted with) as positive examples.
The model also samples a set of negative examples, which are Pins the user has not engaged with. The negative examples can be sampled randomly from the entire Pin corpus or from a specific subset.

Sampled Softmax Loss

The sampled softmax loss is used to compare the similarity between the user embedding and the embeddings of the positive and negative Pins.
For each positive example, the loss encourages the user embedding to be similar to the embedding of the positive Pin.
For each negative example, the loss encourages the user embedding to be dissimilar to the embedding of the negative Pin.

Log-Q Correction

The log-Q correction term is introduced to account for the sampling bias in the negative examples.
It adjusts the loss based on the probability of each negative example being sampled, giving more weight to less frequently sampled Pins.

The loss function can be formulated as follows:

loss = -log(exp(s(u, p)) / (exp(s(u, p)) + Σ exp(s(u, n) - log(Q(n)))))

Where: - u is the user embedding (a row from the matrix E) - p is the embedding of a positive Pin - n is the embedding of a negative Pin - s(u, p) and s(u, n) are the similarity scores between the user embedding and the positive/negative Pin embeddings (e.g., dot product) - Q(n) is the probability of the negative Pin n being sampled

During training, the model learns to maximize the similarity between the user embedding and the embeddings of the positive Pins while minimizing the similarity with the negative Pins. The log-Q correction term helps to adjust for the sampling bias and gives more importance to less frequently sampled negative examples.

By minimizing this loss function, PinnerFormer learns user embeddings and Pin embeddings that capture the user's preferences and the characteristics of the Pins. The learned embeddings can then be used for personalized recommendation by comparing the similarity between the user embedding and the Pin embeddings.

Capturing Long-Term User Interests

A key innovation in PinnerFormer is its focus on predicting long-term user engagement over a multi-day horizon, rather than just the next immediate action. This is achieved through a novel dense all-action loss that aims to predict a variety of future user actions on Pins, such as clicks, close-ups, and repins. By training on this diverse set of engagement signals in a multi-task manner, PinnerFormer learns rich user representations that capture evolving user interests over extended time periods.

Offline experiments show that this long-term engagement objective, combined with the dense all-action loss and multi-task training, significantly improves the quality of the learned user embeddings compared to traditional next-action prediction approaches. PinnerFormer can surface Pins that align with a user's interests over a 2-week horizon, even when the embeddings are only updated once per day.

Offline Inference for Simplified Deployment

Another important aspect of PinnerFormer is its design for offline, batch inference rather than real-time updates. User embeddings are updated daily based on the past day's actions, then served to power various recommendation and ranking systems across Pinterest.

This offline inference setup greatly simplifies the deployment of PinnerFormer in Pinterest's infrastructure. It avoids the need for streaming pipelines to constantly update user embeddings in real-time with each new action, as well as the challenges of managing mutable embedding states. This enables the use of larger, more expressive PinnerFormer models while keeping infrastructure cost and complexity under control.

Interestingly, Pinterest's experiments demonstrate that PinnerFormer's long-term engagement objective also makes it more robust to the staleness that comes with daily batch updates. The performance gap between the daily updated PinnerFormer embeddings and fully real-time embeddings is much smaller compared to sequence models trained on next-action prediction.

Training with Sampled Softmax Loss and In-Batch Negatives

PinnerFormer is trained using a variant of the sampled softmax loss, which contrasts the similarity between a user's embedding and their future engaged Pins' embeddings (positive examples) against randomly sampled negative Pin embeddings.

A key aspect of the training process is the use of "in-batch negatives". In addition to random negatives, the positive examples from other users within the same training batch are used as negatives for the current user. This serves several purposes:

It allows for more efficient use of computation by leveraging embeddings already present in the batch.
In-batch negatives tend to be harder examples, as they represent actual user-Pin engagements, forcing the model to learn more discriminative user embeddings.
Contrasting against in-batch negatives within each batch encourages user embeddings to be more similar to their own positives and dissimilar to other users' positives, capturing user-specific preferences.

However, since popular Pins are more likely to appear as in-batch negatives, the model applies a log-Q correction term to adjust for this sampling bias. The overall loss combines both in-batch and random negatives to balance informative negative sampling with broad coverage of the Pin corpus.

This sampled softmax loss with in-batch negatives enables PinnerFormer to efficiently learn high-quality user embeddings that capture long-term engagement likelihood over a large set of Pins.

Impact on Pinterest's Recommendations

PinnerFormer has been successfully deployed in production to power multiple recommendation systems at Pinterest, including the home feed and ads ranking. It serves as a single, unified user representation that captures users' interests across various surfaces.

The introduction of PinnerFormer has driven substantial gains in Pinterest's key engagement and retention metrics. For example, using PinnerFormer in the home feed pin ranking model led to a 1% increase in total time spent on the platform and a 0.12% lift in weekly active users. This demonstrates how learning high-quality user representations that capture long-term interests can meaningfully improve the relevance and quality of recommendations.

Conclusion

PinnerFormer showcases the impact that a well-designed sequence modeling approach can have on enhancing user representations and recommendations in a large-scale visual discovery platform like Pinterest. The focus on long-term engagement prediction, enabled by the dense all-action loss and offline batch inference design, allows PinnerFormer to learn rich and robust user embeddings that drive gains in key product metrics.

As recommendation systems continue to advance, PinnerFormer underscores the importance of thinking beyond next-action prediction and designing approaches that can effectively capture users' long-term and multi-faceted interests. Its success at Pinterest serves as a valuable case study for leveraging sequence modeling to enhance user understanding and deliver more relevant, engaging experiences.

References

Created 2024-03-11T17:27:14-07:00, updated 2024-03-12T16:33:06-07:00