Bestofmedia’s brands Tom’s Guide and Tom’s Hardware have evergrowing communities and our sites have now millions of web pages, either editorial or forums. When reaching such scales, it becomes complex to have a synthetic view of our content and understand the trends in our communities.
Data visualization is becoming a key component in modern data analysis/mining (, and in french, a post by Franck Ghitalla, a dataviz expert: ) acting as an intuitive informative summarization of huge bags of raw data. This post explains how data visualization can help tackling those problems.
We decided to work on our own data visualization project called “Content Mapping” with following objectives in mind:
- gain insights on our content:
- quickly visualize web content
- make sure our editorial lines match the community needs
- drive analyzes on:
- related content clustering
- tags distribution / tags recommendation
- user profiles: personalization, find similar users based on what page they visited, their profiles, etc.. recommend relevant content
- traffic distribution: identify trending categories vs deprecated topics
Content Mapping: A Non Linear Dimensionality Reduction Approach
Web documents are described by the textual content they contain. Standard text mining approaches propose feature extraction techniques to put documents into a form that can be processed by pattern analysis tools. Bag-of-words represents a document as a vector of word counts. Tag representation considers a document as the set of tags present in the text, etc.. All these representations build document representations in a high-dimensional space. For example, let’s say we have 100k words in our vocabulary, an editorial article or a forum post are both a single point in a 100 dimensional space in the Bag-of-Word representation.
Dimensionality reduction techniques will project the high dimensional documents in a very low latent space (2D or 3D for visualization) where analyzes are much easier and visualization becomes more intuitive. More complex pattern analysis techniques like classification, clustering, etc.. are usually improved in small dimensional spaces (famous Curse of dimensionality).
All the considered Non-Linear techniques (as opposed to linear mappings, PCA or LDA) tend to preserve neighbourhood information in the embedded space. Two docs that are “similar” in the initial high dimensional space will be close in the new space.
We tested 4 different popular dimensionality reduction techniques:
- Locally Linear Embedding (LLE)
- Self-Organizing-Maps (SOM)
- Multi-dimensional Scaling (MDS)
- Isomap 
Content Mapping with Isomap
We obtained the best visualization results with Isomap (although we found out that LLE gave better results in classification purpose, but with >3 dimensions). While MDS uses Eucliean distance to measure proximity of points in the original space, Isomap uses the geodesic distance defined as the sum of edge weights along the shortest path between two nodes (computed using Dijkstra’s algorithm, for example).
We tested two representations for our web documents: bags of words or tags, both have their strengths and drawbacks. Bags of words is to text what pixel representation is to images: they contain full raw information contained in the text. However it tends to be very noisy on user generated content as the full vocabulary can be consequent with many word variants and typos. Tags provide a concise view of a document, only catching the main topics present in the text. We keep this representation, because, as explained later, users and tags themselves can also be mapped on the same content map. Tags will thus be used as a pivotal representation between content and users.
We took a subsample of 10k uniformly distributed documents from our sites and build the non-linear Isomap model using Python’s implementation in the great Scikit Learn Lib .
Experiments & Results
On the following figure (left), each point corresponds to a list of tags, projected on a 2-dimensional map. We also add a heatmap measuring the local density of points in each region using simple Gaussian Kernel Desity estimation (right).
The nice property of the resulting map is that related “topcis” are clustered together. Here is the mapping of the categories of our all our french sites. (To map a category, we project all the documents they contain and look for the max density)
We observe that software/games/programming documents are clustered in the center of the map (most dense red region) . Hardware related content is clustered on the left hand side of the map, and everything related to mobility is at the bottom right.
By projecting each tag individually, we can see where they are centered on the map, and what are their related tags. By plotting a distribution of documents having a given tag we can also instantly get an idea of the spread of the tag on the map. It reflects intuitively if a tag is a core tag widely discussed in our forums or a very localized specific tag. It could be used for example to detect the emergence of a new category in our forums.
We can do the same with each of our users. For that we model the a with its tag profile: the list of tags of the pages she viewed/edited. It can, for example be used to identify experts, of more generic users. We can also use a user’s map to recommend her personalized content.
Here is for example the map of a user specialized in Linux distributions:
Last but not least, we can also plot temporal evolution of the traffic on the map. For that we can projects the distribution of the tags at each time frame. For example in the following animation we can clearly see the emergence of mobile topics in our contents. Bottom-right mobile area was “turn off” early 2011, it becomes more active in Q4 2012. (Click on the figure to see the map evolution over quarters)
We can also plot many other signals/KPIs on the map such as active pages vs crawled pages, revenue, page views per visit, freshness, etc… All we need to plot a signal on the map is a tag decomposition of that signal.
Our tag-based content map gives us a powerful tool to understand our content and visitors/users. It can directly help taking the right decisions to turn content into business and keep our sites focused on users evolving interests.