Content Mapping: A Unified Visualization of Content, Users and Tags

Bestofmedia’s brands Tom’s Guide and Tom’s Hardware have evergrowing communities and our sites have now millions of web pages, either editorial or forums. When reaching such scales, it becomes complex to have a synthetic view of our content and understand the trends in our communities.
Data visualization is becoming a key component in modern data analysis/mining ([1], and in french, a post by Franck Ghitalla, a dataviz expert: [2])  acting as an intuitive informative summarization of huge bags of raw data. This post explains how data visualization can help tackling those problems.

We decided to work on our own data visualization project called “Content Mapping” with following objectives in mind:

  • gain insights on our content:
    • quickly visualize web content
    • make sure our editorial lines match the community needs
  • drive analyzes on:
    • related content clustering
    • tags distribution / tags recommendation
    • user profiles: personalization, find similar users based on what page they visited, their profiles, etc.. recommend relevant content
    • traffic distribution: identify trending categories vs deprecated topics

Content Mapping: A Non Linear Dimensionality Reduction Approach

Web documents are described by the textual content they contain. Standard text mining approaches propose feature extraction techniques to put documents into a form that can be processed by pattern analysis tools. Bag-of-words represents a document as a vector of word counts. Tag representation considers a document as the set of tags present in the text, etc.. All these representations build document representations in a high-dimensional space. For example, let’s say we have 100k words in our vocabulary, an editorial article or a forum post are both a single point in a 100 dimensional space in the Bag-of-Word representation.

Dimensionality reduction techniques will project the high dimensional documents in a very low latent space (2D or 3D for visualization) where analyzes are much easier and visualization becomes more intuitive. More complex pattern analysis techniques like classification, clustering, etc.. are usually improved in small dimensional spaces (famous Curse of dimensionality)[3].
All the considered Non-Linear techniques (as opposed to linear mappings, PCA or LDA) tend to preserve neighbourhood information in the embedded space. Two docs that are “similar” in the initial high dimensional space will be close in the new space.

We tested 4 different popular dimensionality reduction techniques:

  • Locally Linear Embedding (LLE)[4]
  • Self-Organizing-Maps (SOM)[5]
  • Multi-dimensional Scaling (MDS)[6]
  • Isomap [7]

Content Mapping with Isomap

We obtained the best visualization results with Isomap (although we found out that LLE gave better results in classification purpose, but with >3 dimensions). While MDS uses Eucliean distance to measure proximity of points in the original space, Isomap uses the geodesic distance defined as the sum of edge weights along the shortest path between two nodes (computed using Dijkstra’s algorithm, for example).

We tested two representations for our web documents: bags of words or tags, both have their strengths and drawbacks. Bags of words is to text what pixel representation is to images: they contain full raw information contained in the text. However it tends to be very noisy on user generated content as the full vocabulary can be consequent with many word variants and typos. Tags provide a concise view of a document, only catching the main topics present in the text. We keep this representation, because, as explained later, users and tags themselves can also be mapped on the same content map. Tags will thus be used as a pivotal representation between content and users.

We took a subsample of 10k uniformly distributed documents from our sites and build the non-linear Isomap model using Python’s implementation in the great Scikit Learn Lib [8].

Experiments & Results

On the following figure (left), each point corresponds to a list of tags, projected on a 2-dimensional map. We also add a heatmap  measuring the local density of points in each region using simple Gaussian Kernel Desity estimation (right).

fullmap_pointsonlyfullmap

The nice property of the resulting map is that related “topcis” are clustered together.  Here is the mapping of the categories of our all our french sites. (To map a category, we project all the documents they contain and look for the max density)
category_labels

We observe that software/games/programming documents are clustered in the center of the map (most dense red region) . Hardware related content is clustered on the left hand side of the map, and everything related to mobility is at the bottom right.

By projecting each tag individually, we can see where they are centered on the map, and what are their related tags.  By plotting a distribution of documents having a given tag we can also instantly get an idea of the spread of the tag on the map. It reflects intuitively if a tag is a core tag widely discussed in our forums or a very localized specific tag. It could be used for example to detect the emergence of a new category in our forums.
tag_map_Système_d'exploitationtag_map_Mobile

We can do the same with each of our users. For that we model the a with its tag profile: the list of tags of the pages she viewed/edited. It can, for example be used to identify experts, of more generic users. We can also use a user’s map to recommend her personalized content.
Here is for example the map of a user specialized in Linux distributions:

participant_map_suseX

Last but not least, we can also plot temporal evolution of the traffic on the map. For that we can projects the distribution of the tags at each time frame.  For example in the following animation we can clearly see the emergence of mobile topics in our contents. Bottom-right mobile area was “turn off” early 2011, it becomes more active in Q4 2012. (Click on the figure to see the map evolution over quarters)
creation_years

We can also plot many other signals/KPIs on the map such as  active pages vs crawled pages, revenue, page views per visit, freshness, etc… All we need to plot a signal on the map is a tag decomposition of that signal.

Conclusions

Our tag-based content map gives us a powerful tool to understand our content and visitors/users.  It can directly help taking the right decisions to turn content into business and keep our sites focused on users evolving interests.

Posted in Innovation, machine_learning, Science | Tagged , | Leave a comment

How we do python

We use more and more Python in our developmenet here at Bestofmedia. We love it because of it’s flexibility, it allows rapid prototyping, is very easy to learn. And it is blazing fast to serve specific web services. You can find below how do we mainly use it .



Posted in General, Software Engineering | Tagged | 3 Comments

iframes and javascript cross-domain security

This is surely one of the problems that a frontend developer has met at least once in his life: “javascript cross-domain”. We all experienced such an issue, but did you know that you encounter the issue through iframes as well?

Did you say “cross-domain security” ?

Yes! The eternal “Permission denied” when you want to manipulate a page hosted on another domain in javascript :)

However nothing is impossible :) … or almost as long as you know what you are doing.

The hack

In this post I will talk about a trick through an intermediary iframe called “proxy“. The method seems complicated at first, but in the end, if you think “proxy”, everything becomes very easy.

Let’s go! Let’s set the stage.

Scenario

“As a frontend developer, I need to open a modal widget that contains an iframe. This widget can handle my page, the background color for example, and close my Modal.”

Note:

  • use img.tomsguide.com for the domainA
  • use img.bestofmicro.com for the domainB

Test 1: We will first perform a test to check the error “Permission denied”

When I try to get the property by using “parent.” or “top.” I have a javascript error.

domainA:

This page includes the widget modal and the close function.


<div id="modalOverlay"></div>
<div id="modalContainer"><iframe src="http://img.bestofmicro.com/demo/domainB.html" frameborder="0"></iframe></div>


function closeModal() {
document.getElementById('modalOverlay').style.display = 'none';
document.getElementById('modalContainer').style.display = 'none';
}

domainB:

Here we have the 2 links getting an error.


<ul>
<li><a href="#" onclick="javascript:parent.document.getElementsByTagName('body')[0].style.background = 'red'; return false;">using parent.</a></li>
<li><a href="#" onclick="javascript:top.closeModal(); return false;">using top.</a></li>
</ul>

Test 2: We will then do a second test through the iframe proxy

domainB:

  • the new links:

<ul>
<li><a href="#" onclick="gotoAction('color:red');">red</a></li>
<li><a href="#" onclick="gotoAction('color:blue');">blue</a></li>
<li><a href="#" onclick="gotoAction('close');">close modal</a></li>
</ul>

  • the iframe proxy:
<iframe src="http://img.tomsguide.com/demo/proxyA.html" id="iframeProxy" width="0" height="0"></iframe>
  • the javascript function:

// Here you can call this function to make the job through the proxy.
function gotoAction(action) {

var iframeProxy = document.getElementById('iframeProxy');

// Step 1: put the parameters you want to give to the iframe proxy in the hash
var src =  iframeProxy.src.split('#');
iframeProxy.src = src[0] + '#' + action;

// Step 2: change the size of the iframe proxy to execute his resize event (see proxyA.html)
document.getElementById('iframeProxy').width = parseInt(document.getElementById('iframeProxy').width) + 1;

return false;

}

proxyA:


// This event is fired when you change my window size
window.onresize = function(){

// Retrieve the hash of my url, ex: #color:red
var hash = document.location.hash;

if(hash != ''){

// Split # to extract "color:red"
var hash = hash.split('#');

// Split : to extract "color" and "red"
var params = hash[1].split(':');

switch(params[0]){

case 'color':
// Change my parent body background color
window.top.document.getElementsByTagName('body')[0].style.background = params[1];
break;

case 'close':
// Call my parent funtion closeModal
window.top.closeModal();
break;

default:
break;
}
}
};

Here is a picture of what exactly happens:

Note 1: The page proxyA MUST be hosted on the same domain as domainA

Note 2: It is also possible to pass parameters by querystring (?action=color&value=red) with window.onload, but i chose to pair the hash and resize to trigger my actions without reloading the iframe proxy.

Demo & download

Sources:

  • http://softwareas.com/cross-domain-communication-with-iframes
  • http://ternarylabs.com/2011/03/27/secure-cross-domain-iframe-communication/
  • http://pipwerks.com/2008/11/30/iframes-and-cross-domain-security-part-2/
Posted in General | 4 Comments

Chrome extensions: one code to rule them all

And now for something completely different!

Why?

BestOf Media is all about its communities and their participation to the forum. A user creates a topic, and members (or passers by) answer the question, or argue about the topic. To be able to see new contents on a topic where he contributed, one has to either refresh the topic page, refresh the contributions’ page or refresh the main site home page and then look up the status of the corresponding threads. That’s a lot of refresh, a lot of user voluntary actions. So came out the idea of a tool that would check the state of the user’s contributions and notify him of new activities.

What?

The challenge here is that our group has multiple sites, each with their own community. Each of those sites uses it own variations of a portal implementation (from two main frameworks) to display the site’s forum. The goal here was to give all our members the same user experience with a tool that will be slightly adapted to the targeted site. Hence we were trying to create, for a start, 5 extensions, all of them using the same code base. Each of the extensions is only different from an other by its name, its targeted site and its visual identity. We also need to cover 3 languages for those 5 extensions: English, German and French.

How?

Chrome (or Chromium) offer a simple but yet quite potent browser’s extensions framework. It uses HTML, CSS and JavaScript APIs, turning Chrome into a platform. The scheme of the tool would be to query a page on a regular basis, parse it for data (which threads have new content?  does the user have the private messages? how many? and so on) and then notify the member of new things to be read. Now that we have set the stage up, let’s about the bowels of the tools.

Deep inside

The framework eases the development of localized extensions. It only a matter of putting the messages files in the correct directories, as stated in the official documentation. That was one problem solved. Now the point is to create a code base that could be replicated with minimal changes to fit every site specificity. Ant we be used to build the extensions (that is, copy the common code base to a target folder, along with site specific data and them zip that target folder). Here is the folder structure we used: Project folder structure

Here, the src/template-drap4chrome folder is the common code base, and every other folder in src contain the specific part for the said site: some specific UI parts, the extension manifest file and the main script. The manifest file describes the extension, giving it a name, some description. It also declares which specific permissions are required. The permissions required are presented to the user upon extension installation, so he knows what the extension is up to. The manifest is also where the extension states which web site is to be accessed. The main script is the background page that set the correct run environment for the extension. Here, it creates the extensions badge (the icon appearing in the extensions bar) and it initializes the regular refresh process. It also declares which site specific implementation of the extension is to be used. Now we have separated resources, the business logic (mainly regular expressions used to parse the page and find useful data) is to be put in its own container. The way we did it is a hierarchy of JavaScript classes defining the way to handle the page.

Action!

You can see those extensions in action for Tom’s Hardware, Tom’s Hardware UK, Tom’s Hardware France and Tom’s Guide France, using either Chrome or its open source sibling Chromium.

Posted in General, Product | Leave a comment

A little Redis storage strategies benchmark

TL;DR: don’t over-think when working with Redis.

Goal

At BestOf Media, we keep a lot of data about our pages. Here, we needed a fast, flexible way to store some structured data. We then end up with 500k records per site, making roughly 3M records for all our sites, each being an average of 4kB of data (based on the storage solution we are currently using).

Solutions

We tested  two schemas for storage: a naive schema, putting the json serialized data as value of Redis record, the key being a document ID, this is our solution 1

Variant: we will try with zlib string compression, this one being solution 2.

An other schema solution is to have the document ID mapped to a list of tuples. The tuples are (expressionId, urlId, contentTypeId). Then we have exprId mapped to the expression text, urlId mapped to the url, and contentTypeId to the contentType. So is solution 3.

Loading

As of today, those data are stored in our main repository. This one is great as searching for document, but does not perform so well when you want it to behave as a DB. Today, loading all the data for one of our site takes about 5 hours.

The loading test will put those data in Redis, with 100 records long pipelines, and with 500 records long ones. All code executed along this piece is Python code, using redis-py client API.

Redis is flushed between each test.

The tests

Both the indexers take 4 parameters:

  1. the data file
  2. wether or not you want to use zlib
  3. the redis DB to use
  4. the pipeline length

Solution 1

With 100 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv false 1 100
Done 454500 lines
real	3m7.373s
user	2m54.579s
sys	0m2.608s
$ redis-cli info | grep used_memory
used_memory:1384828936
used_memory_human:1.29G
used_memory_rss:1422528512
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv false 1 500
Done 454500 lines
real	3m5.957s
user	2m53.483s
sys	0m2.508s
$ redis-cli info | grep used_memory
used_memory:1384837688
used_memory_human:1.29G
used_memory_rss:1422614528
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Solution 2

With 100 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv true 1 100
Done 454500 lines
real	3m29.069s
user	3m19.928s
sys	0m1.688s
$ redis-cli info | grep used_memory
used_memory:434327168
used_memory_human:414.21M
used_memory_rss:450281472
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv true 1 500
Done 454500 lines
real	3m15.490s
user	3m12.196s
sys	0m0.912s
$ redis-cli info | grep used_memory
used_memory:434318784
used_memory_human:414.20M
used_memory_rss:447762432
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Solution 3

With 100 records pipeline:

$ time python explodedBlocks/Indexer.py blocks.csv false 1 100
Done 454500 lines (1302542 expressions, 160445 urls, 3 services)
real	5m43.288s
user	5m34.313s
sys	0m2.112s
$ redis-cli info | grep used_memory
used_memory:480364184
used_memory_human:458.11M
used_memory_rss:517025792
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python explodedBlocks/Indexer.py blocks.csv false 1 500
Done 454500 lines (1302542 expressions, 160445 urls, 3 services)
real	5m55.540s
user	5m42.821s
sys	0m2.740s
$ redis-cli info | grep used_memory
used_memory:480369728
used_memory_human:458.12M
used_memory_rss:496918528
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Some conclusions and space considerations

Version Indexing Duration Index Size
Naive (100 pipe) 3’7″ 1.29Go
Naive (500 pipe) 3’6″ 1.29Go
Naive w/ zlib (100 pipe) 3’30″ 414.2M
Naive w/ zlib (500 pipe) 3’15″ 414.2M
Exploded (100 pipe) 5’43″ 458.1M
Exploded (500 pipe) 5’55″ 458.1M
Exploded w/ zlib (100 pipe) 9’3″ 468.1M

Writing process is blazing fast with the naive version. Even better, compressing the blocks is harmless in term of indexing time, and helps dividing the size by 3.

Let’s see how every solution behave with storage size. The source files are generated using:

for i in 2 11 101 1001 10001 50001 100001 250001 ; do head -n$i blocks.csv > blocks.$i.csv; done

Each file is indexed on db 1 after a FLUSHALL.

1 10 100 1000 10000 50000 100000 250000 454584
naive 0.71 0.74 0.97 3.58 29.91 145.96 291.35 726.15 1320.96
naive w/ zlib 0.71 0.71 0.8 1.61 9.9 46.28 91.9 227.74 414.2
exploded 0.71 0.76 1.19 6.08 37.22 107.99 164.39 300.34 458.1

Let’s see it graphically (the size is reported on a log scale):

The DB size

The DB size, depending on the number of blocks, mind the log scale.

Here we see the exploded version grows slower that the zlib one, even though it takes more space for smaller blocks sets.

Query performances

Let see how the 3 solutions behave, regarding query perfomances. Here, we will use a small set of URLs (150). We will do 100 queries on each, in random order, and read quantiles.

We are working on a Redis store full loaded: each of the three solutions is loaded in its own Redis db. We only test pure repository speed: reading the block from Redis and formating the data so it is ready to be used by further business logic.

$ redis-cli info | grep used_memory
used_memory:2298063072
used_memory_human:2.14G
used_memory_rss:2356830208
used_memory_peak:2298054472
used_memory_peak_human:2.14G
Solutions' response time

Solutions' response time in ms

Response times are in ms. Each solution is fast, the zlib one being the fastest.

Conclusion

Size

The exploded version seems to grows slower than the zlib one. But exploded version will be lighter with around 1M blocks of data for a single brand. If we index two sites of 0.5M blocks, it is quite likely the size will be twice 0.5M base. This is because indirection is efficient only inside a pack of blocks.

On the other hand, if we ever start to build cross sites data blocks, the size will be less than the size of naive zlib.

Speed

The zlib version is the fastest one: we only query the docId, then the compressed block, and we uncompress it. It is even faster than querying for the raw block, because of the smaller network traffic.

The exploded version is clearly a LOT slower, but stays under 0.5ms per request for the vast majority.

Bottom line

Do not over think your data schema when storing to Redis, it is faster than you may think, and a simple software compression may help you contain your data growth.

If you want to optimize data weight, you want to read Memory optimization from the Redis documentation.

Posted in General, Techno | Leave a comment

Responsive Design as mobile strategy

Nowadays, one can’t have a website withtout thinking of the mobile strategy you need. This is even more true when you act as a heavy audience actor of the communities and News groups of the web.
I’ll tell you a secret. Mobile audience is growing. Every day. Ok, it’s not really a secret for you, but i can share with you some of our stats: in the last 3 months, our users coming to our sites from a mobile device represent between 5 and 5.4% of all our visitors, and the proportion of smart-phones (iOS and Android plateforms) has exploded and keep  growing constantly.
We had before the old fashioned way to serve to our users a mobile version of our site :  we used to call it mobile as it was a more simplified version of our site like Tomshardware mobile. But we never got very satisfied of the result. The lack of updates on this version and, we have to admit it, the not-so-sexy display has let us a little frustrated.
We are currently working on a full redesign of our site (as you may have noticed this year we already have migrated our Forums). So the opportunity to try to get our current version of the site that is responding and adapting to the end user device comes up with all the cool things that now can be handled with the new technologies that we intend to use.
By combining the power of CSS3 media queries and the possibility to detect and refine users through the Client User Agent, we intend to modify and enhance the current (mobile)user experience.
So we combine two ways:

The first one is to get a more Responsive Design. It is based on the approach that the look and feel, and the navigation should adapt themselves to the end user environment based on its screen size, platform and orientation.

The second one, Server Side Components, says, in a nutshell, that the server should be able to adapt the code sent to the end user based on some scalars and criterias.
We chose to rely on the User agent information for this, in order to send or not some components of the pages. As for mobile strategy, the main chalenge is to send to the user only the data he needs. This options allows us to send only components of the page.
Our pages are splitted into components, that we chose to display and thus, to send or not the corresponding HTML data.

This Server side component strategy is currently still in progress, so i’m not able to show you anything yet. But we will do it soon.
But we can show you more easily how we will use the first strategy: as a demo is always better than words, let’s say it in real with the Demo page.

This post is the first about mobile initiative at bestofmedia, we will keep you informed with any new feature available on the site…

Ressources :

Posted in General, Innovation, Techno | Tagged , , , , | Leave a comment

PHP refactoring in legacy code

The story we’ll talk about is a true story. It happened to be challenging and helped the team keep testing its beliefs in XP, iterative developments and code quality.

Product elevator statement

Imagine a well legac”ied” project you don’t know.

  • Product is a web forum with millions of messages.
  • We want to rebuild the categorization mechanism (messages are “categorized” meaning they are assigned to a category that best describes their content).
  • Mission : fix all bugs
  • “Short delay” and “no regression” are the words.
  • Only few people share the knowledge of the categories system to be refactored.
  • Numerous bugs (useless to mention that several generations of developers brought contributions to the project).
  • 20 commiters.

Background

From the team’s point of view, here are the goals we anticipated we needed to achieve:

  • Understand the expected behavior of the categorization mechanism
  • Bring no regression to the actual behavior
  • Replace the old mechanism by a new one

First decision we took was to use Git to work on that project. We won’t explain in details that choice (20 commiters, we wanted to avoid working in a dedicated branch for weeks and commit in the HEAD trunk of the project…). It has already been discussed here.

Refactoring strategy

As a Team, we decided to do the refactoring as follow:

  1. With the Product Owner, write BDD scenarii describing how the categories mechanism works
  2. Switch on the “Test Harness” by automating (implementing) the BDD scenarii
  3. Encapsulate ALL calls to the old categories mechanism behind an API (adding Unit Tests to that new API aswell)
  4. Based on the API contract, build the new mechanism relying on a new categories data model

1. Write BDD scenarii to describe the categorization behavior

First two weeks were spent “extracting” all the possible knowledge from the Product Owner about the product and translate it into BDD scenarii.
Example:

Given I am a visitor
When I go to url "http://www.infos-du-net.com.sf/forum/"
Then below the meta-category "Multimédia", I have the following sub-categories with content
| cat name                 | decrypted url                                      |
| Image et son             | http://www.infos-du-net.com.sf/forum/forum-20.html |
| Appareils photo, cameras | http://www.infos-du-net.com.sf/forum/forum-47.html |
| Consoles                 | http://www.infos-du-net.com.sf/forum/forum-29.html |

At the end of this step:

  • 100 BDD scenarii written
  • Shared knowledge of the expected application behavior

2. Switch on the “Test Harness”

We used Behat (PHP based) to implement the scenarii.

Some of the scenarii written with the Product Owner describe a behavior involving integration with third-party systems. They were not implemented because such tests, seen as “integration tests”, were seen as complicated and hard to maintain. We preferred to invest on Unit Tests by Contract (as well explained by JBrains).

Some scenarii were implemented but not automated because describing a behavior that highlights a bug or describing the future behavior. They got RED at the time of the implementation and would go GREEN by the end of the project.

At the end of this step:

  • The “Test Harness” is switched on !
  • Thanks to the Continuous Integration Platform, we are able to frequently test the categories behavior and ensure we will not break anything during the refactoring.

3. Encapsulate old categorization mechanism behind an API

Example of code BEFORE encapsulation (old DAO was FrmCategoryTable)


public function executeIndex(sfWebRequest $request) {

$categoryList = FrmCategoryTable::getForumList($idSite, $culture, $user);

}

In order to better test and avoid perturbation with other commiters, we’ve encapsulated all calls to the old category mechanism behind a new API.

We keep the calls to the old category mechanism, but we isolate them into a dedicated API.

Example of code AFTER encapsulation (new API is categoryProvider)


public function executeIndex(sfWebRequest $request) {

$categoryList = $this->categoryProvider->getAllCategories($culture, $brand, $country, ICategoryProvider::SERVICE_FORUM, $user);

}

Code that implements the new API


class CategoryProvider implements ICategoryProvider {
public function getAllCategories($culture, $brand, $country, $service, $user) {
$categoryList =
CatBrandAndCountryTable::getInstance()
->getAllCategories
($culture, $brand, $country, $service, $user);
return $categoryList;
}
}

At the end of this step:

  • The old mechanism is isolated behind an API
  • The “Test Harness” is still switched on !

4. Based on the API contract, build the new mechanism relying on the new categories data model

During the encapsulation step we’ve created the API that is the CONTRACT of our categories mechanism.
At this time we made the choice to start the implementation of the new API. It was probably not the best choice because for several days the new behavior was only partly implemented. We should have worked on another implementation of the API based on the CONTRACT we had extracted from the previous step.

Only once this is done, we should have switched from one implementation of the API to the other.

Code that implements the new API


class CategoryProvider implements ICategoryProvider {
public function getAllCategories($culture, $brand, $country, $service, $user) {

$categoryList = FrmCategoryTable::getForumList(
$siteId, $culture, $user, $categoryLevel);

}
}

At the end of this step:

  • The new mechanism is plugged (new DAO CatBrandAndCountryTable)
  • The “Test Harness” is still switched on !

Conclusion

  • Quite a big system was refactored without service interruption
  • No merge conflicts because we always committed in the trunk/HEAD
  • No projects conflicts because we isolated the pieces of code that were aimed to be re-factored
  • The writing of BDD scenarii WITH THE Product Owner helped implementing the right behavior and sharing the knowledge.
Posted in General | Leave a comment

Conference Agile Grenoble 2011 – PHP symfony in an Agile environment

Conference Agile Grenoble 2011 took place on the 24th of November 2011. At this occasion, some of our engineers gave a presentation. Below is the summary. For more details, have a look at the slides (slides in French) or/and download the technical ressources .

Having millions of friends, comparing millons of offers or publishing millions of news are as many different applications written in PHP. Often criticized, sometimes called “language for dummies“, it remains the first choice for web solutions. If you know some good tools and some good methods, PHP is modulable, testable and easy to deploy. Thanks to technical examples and based on real life projects, experience how to play the “PHP symfony in an agile environment” (in french “symphonie pour PHP industrialisé en agilité majeure”).

Menu:

  • Build a boilerplate of a PHP project aiming not to throw anything at the end
  • Get the control back on your frontend project and have a refactoring strategy built on tests
  • “Not only working software, but also well-crafted software” – Manifesto for Software Craftsmanship
Posted in Agile, CI, Continuous Delivery, Continuous Improvement, Software Engineering, TDD, Test Automation, XP | 1 Comment

Pretested commits – why does it matter to us?

The problem

Our CI was frequently red and that creates work. During a two-week period we measured that 54% of the builds resulted in failure. More than half of the failures followed a previous failure. That means that by a low estimation 25% of the commits were made while the CI was red.

Ok but wherein lies the problem?

  • It’s more difficult to fix the problems when they’ve been there for a while
  • As time passes it’s becomes unlikely that the guy who broke the build will be the guy to fix it. And it’s definitely more difficult to fix other peoples bugs.
  • The fact that I commit while the CI is red, means that I won’t get the necessary feedback. I can easily introduce a new error that doesn’t get caught until the first one is fixed.
  • When a developer updates his code he also gets other peoples bugs, so when he discovers a problem he can’t be sure it came from his modifications.
  • It is a reason to maintain extra branches (one for development, one for patches to production)

So, indeed it creates work.

Analysis

So lets just establish that everyone always runs the tests before publishing their work, right?
- Tried it, doesn’t work.
Well not?
- Because it’s always tempting not to.
Why?
- Because it’s difficult to be sure whether it’s your bug or someone else’s.
Why?
- Because other developers commit untested code.
Oops! Vicious circle.
But that’s not the only reason, what’s more?
- It takes time to run the tests and it blocks the development environment.
Nasty. Any more?
- It takes discipline to run all the tests before every “publication” of my code and discipline is a limited resource. Someone will run out of it and then it will get a lot more tempting for the others to skip it.
Oops! vicious circle again.

So establishing a run-your-tests-locally-before-commiting-or-else-shame-on-you culture isn’t going to work for very long.

Solution

Since about two years a workflow called pretested commit, delayed commit, private build or stable build is emerging. It’s even a feature of TeamCity CI. What’s so cool about this is that it’s not a countermeasure to the problem “developers commit untested code”, it eliminates the problem all together by removing the root cause “running tests blocks developers” altogether (the fancy japaneese term is poka yoke). The basic workflow is that all tests are run before the commit to the shared development branch actually happens – let’s call it the stable branch. Thus ensuring that the latest version of the stable branch always contains code that passes the tests. It also means that whenever you update your code you don’t get bugs from the others. In fact if there are any bugs, they’re all yours!

The way we choose to implement the commit barrier was to first migrate the whole project from SVN to Git and then use the Jenkins Git plugin to configure the following workflow.

Say a developer wants to start a new feature.

  • He starts by checking out a new feature branch
  • he commits some modifications locally (since we’re using git)
  • he does some more work and commits again
  • then he pushes his branch to the team repository refs/heads/merge-requests/<my-name>/<branch-name>
  • Jenkins takes the branch merges it with the stable branch
  • if the code doesn’t merge cleanly nothing is done to the stable branch and the committer is notified by mail.
  • if any tests fail, then again nothing is done to the stable branch and the committer is notified by mail.
  • if the build succeeds, the stable branch is updated* with this latest stable version.
    (* : in git branches are like post-its that you can move around, so “updating the stable branch” just means “move to post-it named stable to the just tested commit”)

It must add that this was what fit our reality best, there are many ways of doing it. For instance if you got a really fast build that runs in isolation you might want something simpler.

So we get an always stable branch. That is, with respect to our automated tests. No more useless work created by untested commits.

Of course there are some things that don’t get into this pretested commit feature. We have some long running tests that are not convenient to include. Our build time, for a successful build, is currently 30 minutes, which is already long. It monopolizes a shared resource so we can’t run two of them in parallel. If we’d increase the test suite to encompass later stages like deployment and smoke tests to pre-production platforms it’d take longer and would make the queue to have your merge request build longer.

Another interesting fact is that the faster the test suite is the more tests we can move into to the pre-commit build and the less costly any mistake is. What a clear connection between a fast test suite and productivity!

Results & learnings

The build is still red sometimes, but that just means that someone monopolized the CI for a couple of minutes. So it’s not much of an issue.

A change like this, involving 3 teams, 2 projects and 15+ developers is not so easy to get going. It takes enough analysis and measurements to have a majority agree that there is a problem worth solving and that there are good enough solutions to it. For instance the fact that the CI is red is not a problem in itself. The real problem is the extra work that is created by the mechanisms described in the beginning of this post.

Still, sometimes the commit barrier can be annoying, since it’s now more difficult to supply a patch – it actually has to pass the tests! No-shit. This is of course good for a majority of modifications and for the developers as a group. But in some edge-cases (like a i18n fix) it can be an annoyance to the individual developer.

The actual setup is surprisingly easy with Git. We used the Jenkins Git plugin but it’d take a couple of lines of shell to do almost the same.

Probably the biggest difficulty was to modify the deployment scripts in safe manner. Modifications were done by a developer and testing and switch were made by ops. Don’t split such a task between two groups unless you have utterly fluid communication between them.

Next steps

Theoretically we’re now able to always deploy from the stable branch. Even patches can go in here because the development branch is always in a fit state, no more production branches. In practice we’re not quite ready but it is our next reachable step.

We also need to work on reducing the build time. As the application grows we will add tests, so if we don’t speed them up our build time will grow.

Posted in Agile, CI, Continuous Improvement, General, Test Automation, XP | Leave a comment

Surfing the Wavelets : Multi-Scale Innovation

The nice thing about Wavelets is that they allow to study signals at multiple resolutions. This is exactly the same idea we nurture with Innovation. Let’s take the tour…

Today is the kick-off of the in-house PhD we set up on Web Buzz mechanics. When you think about what you can do in a 3 years PhD program, you might think that it’s not the kind of schedule that fits in a Web startup-like company. A PhD could be compared to running a Marathon : it will take a fair amount of time and pain to reach the finish line and you’d better keep a steady rhythm to post a good time. Gaining knowledge and expertise in a challenging field, developing new algorithms, experimenting with real-world data, feeding back theory, on and on…

Long distance running...

On the other side, a Lean-Agile company like BoM (see here or there) is committed to embrace change and meet challenges every day or so. For sure our R&D teams keep a rhythm that is a little more upbeat than the one of a PhD. Daily activities resemble those of a bees hive : standups, short iterations, delivering projects, testing (first of course!), lost of interactions, taking up new projects as soon as they hit the ground.

Scrum Board

Then is there really a contradiction ? I don’t think so. Lean always insisted on the importance of building deep knowledge. And this is exactly what a PhD is for : developing theory(ies) on a given field, based on experiments and scientific evaluation and confrontation with leading thinkers. Of course it does not mean we stop being Agile or reactive because we take the time to build such deep knowledge. In the contrary, we manage both scales and organize things so that each one can benefit from the other.

A first example is Standups. Participating in standups allow PhD candidates to confront what ideas (s)he may have about any featured topic with the ground reality faced by practitioners. I’m convinced it can avoid a few perils when taking ideas from the lab to the live testing ground.

A reverse example is the building of a Roadmap. Frontline engineers have a number of ideas about what should be done to fix the most urgent problems and capitalize on the company’s technical assets. Scientists have a list of brand new, cool models and algorithms they want to play with. Managers may have more long term ideas for the company and a prioritized list of high level goals. And PhDs may have what it takes to feed everyone with facts and theory that are able to get to the bottom of things. Which everyone will digest and utilize to make the whole business model evolve towards something more robust in the long term.

Indeed, this is really what this kind of organization is all about : addressing the short term challenges with the ferocity of the lion and building for the future with the wisdom of the elephant.

 

 

Posted in Agile, Innovation, machine_learning, Science | 7 Comments