Chrome extensions: one code to rule them all

And now for something completely different!

Why?

BestOf Media is all about its communities and their participation to the forum. A user creates a topic, and members (or passers by) answer the question, or argue about the topic. To be able to see new contents on a topic where he contributed, one has to either refresh the topic page, refresh the contributions’ page or refresh the main site home page and then look up the status of the corresponding threads. That’s a lot of refresh, a lot of user voluntary actions. So came out the idea of a tool that would check the state of the user’s contributions and notify him of new activities.

What?

The challenge here is that our group has multiple sites, each with their own community. Each of those sites uses it own variations of a portal implementation (from two main frameworks) to display the site’s forum. The goal here was to give all our members the same user experience with a tool that will be slightly adapted to the targeted site. Hence we were trying to create, for a start, 5 extensions, all of them using the same code base. Each of the extensions is only different from an other by its name, its targeted site and its visual identity. We also need to cover 3 languages for those 5 extensions: English, German and French.

How?

Chrome (or Chromium) offer a simple but yet quite potent browser’s extensions framework. It uses HTML, CSS and JavaScript APIs, turning Chrome into a platform. The scheme of the tool would be to query a page on a regular basis, parse it for data (which threads have new content?  does the user have the private messages? how many? and so on) and then notify the member of new things to be read. Now that we have set the stage up, let’s about the bowels of the tools.

Deep inside

The framework eases the development of localized extensions. It only a matter of putting the messages files in the correct directories, as stated in the official documentation. That was one problem solved. Now the point is to create a code base that could be replicated with minimal changes to fit every site specificity. Ant we be used to build the extensions (that is, copy the common code base to a target folder, along with site specific data and them zip that target folder). Here is the folder structure we used: Project folder structure

Here, the src/template-drap4chrome folder is the common code base, and every other folder in src contain the specific part for the said site: some specific UI parts, the extension manifest file and the main script. The manifest file describes the extension, giving it a name, some description. It also declares which specific permissions are required. The permissions required are presented to the user upon extension installation, so he knows what the extension is up to. The manifest is also where the extension states which web site is to be accessed. The main script is the background page that set the correct run environment for the extension. Here, it creates the extensions badge (the icon appearing in the extensions bar) and it initializes the regular refresh process. It also declares which site specific implementation of the extension is to be used. Now we have separated resources, the business logic (mainly regular expressions used to parse the page and find useful data) is to be put in its own container. The way we did it is a hierarchy of JavaScript classes defining the way to handle the page.

Action!

You can see those extensions in action for Tom’s Hardware, Tom’s Hardware UK, Tom’s Hardware France and Tom’s Guide France, using either Chrome or its open source sibling Chromium.

Share
Posted in General, Product | Comments Off

A little Redis storage strategies benchmark

TL;DR: don’t over-think when working with Redis.

Goal

At BestOf Media, we keep a lot of data about our pages. Here, we needed a fast, flexible way to store some structured data. We then end up with 500k records per site, making roughly 3M records for all our sites, each being an average of 4kB of data (based on the storage solution we are currently using).

Solutions

We tested  two schemas for storage: a naive schema, putting the json serialized data as value of Redis record, the key being a document ID, this is our solution 1

Variant: we will try with zlib string compression, this one being solution 2.

An other schema solution is to have the document ID mapped to a list of tuples. The tuples are (expressionId, urlId, contentTypeId). Then we have exprId mapped to the expression text, urlId mapped to the url, and contentTypeId to the contentType. So is solution 3.

Loading

As of today, those data are stored in our main repository. This one is great as searching for document, but does not perform so well when you want it to behave as a DB. Today, loading all the data for one of our site takes about 5 hours.

The loading test will put those data in Redis, with 100 records long pipelines, and with 500 records long ones. All code executed along this piece is Python code, using redis-py client API.

Redis is flushed between each test.

The tests

Both the indexers take 4 parameters:

  1. the data file
  2. wether or not you want to use zlib
  3. the redis DB to use
  4. the pipeline length

Solution 1

With 100 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv false 1 100
Done 454500 lines
real	3m7.373s
user	2m54.579s
sys	0m2.608s
$ redis-cli info | grep used_memory
used_memory:1384828936
used_memory_human:1.29G
used_memory_rss:1422528512
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv false 1 500
Done 454500 lines
real	3m5.957s
user	2m53.483s
sys	0m2.508s
$ redis-cli info | grep used_memory
used_memory:1384837688
used_memory_human:1.29G
used_memory_rss:1422614528
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Solution 2

With 100 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv true 1 100
Done 454500 lines
real	3m29.069s
user	3m19.928s
sys	0m1.688s
$ redis-cli info | grep used_memory
used_memory:434327168
used_memory_human:414.21M
used_memory_rss:450281472
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python jsonBlocks/Indexer.py blocks.csv true 1 500
Done 454500 lines
real	3m15.490s
user	3m12.196s
sys	0m0.912s
$ redis-cli info | grep used_memory
used_memory:434318784
used_memory_human:414.20M
used_memory_rss:447762432
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Solution 3

With 100 records pipeline:

$ time python explodedBlocks/Indexer.py blocks.csv false 1 100
Done 454500 lines (1302542 expressions, 160445 urls, 3 services)
real	5m43.288s
user	5m34.313s
sys	0m2.112s
$ redis-cli info | grep used_memory
used_memory:480364184
used_memory_human:458.11M
used_memory_rss:517025792
used_memory_peak:1384845632
used_memory_peak_human:1.29G

With 500 records pipeline:

$ time python explodedBlocks/Indexer.py blocks.csv false 1 500
Done 454500 lines (1302542 expressions, 160445 urls, 3 services)
real	5m55.540s
user	5m42.821s
sys	0m2.740s
$ redis-cli info | grep used_memory
used_memory:480369728
used_memory_human:458.12M
used_memory_rss:496918528
used_memory_peak:1384845632
used_memory_peak_human:1.29G

Some conclusions and space considerations

Version Indexing Duration Index Size
Naive (100 pipe) 3’7″ 1.29Go
Naive (500 pipe) 3’6″ 1.29Go
Naive w/ zlib (100 pipe) 3’30″ 414.2M
Naive w/ zlib (500 pipe) 3’15″ 414.2M
Exploded (100 pipe) 5’43″ 458.1M
Exploded (500 pipe) 5’55″ 458.1M
Exploded w/ zlib (100 pipe) 9’3″ 468.1M

Writing process is blazing fast with the naive version. Even better, compressing the blocks is harmless in term of indexing time, and helps dividing the size by 3.

Let’s see how every solution behave with storage size. The source files are generated using:

for i in 2 11 101 1001 10001 50001 100001 250001 ; do head -n$i blocks.csv > blocks.$i.csv; done

Each file is indexed on db 1 after a FLUSHALL.

1 10 100 1000 10000 50000 100000 250000 454584
naive 0.71 0.74 0.97 3.58 29.91 145.96 291.35 726.15 1320.96
naive w/ zlib 0.71 0.71 0.8 1.61 9.9 46.28 91.9 227.74 414.2
exploded 0.71 0.76 1.19 6.08 37.22 107.99 164.39 300.34 458.1

Let’s see it graphically (the size is reported on a log scale):

The DB size

The DB size, depending on the number of blocks, mind the log scale.

Here we see the exploded version grows slower that the zlib one, even though it takes more space for smaller blocks sets.

Query performances

Let see how the 3 solutions behave, regarding query perfomances. Here, we will use a small set of URLs (150). We will do 100 queries on each, in random order, and read quantiles.

We are working on a Redis store full loaded: each of the three solutions is loaded in its own Redis db. We only test pure repository speed: reading the block from Redis and formating the data so it is ready to be used by further business logic.

$ redis-cli info | grep used_memory
used_memory:2298063072
used_memory_human:2.14G
used_memory_rss:2356830208
used_memory_peak:2298054472
used_memory_peak_human:2.14G
Solutions' response time

Solutions' response time in ms

Response times are in ms. Each solution is fast, the zlib one being the fastest.

Conclusion

Size

The exploded version seems to grows slower than the zlib one. But exploded version will be lighter with around 1M blocks of data for a single brand. If we index two sites of 0.5M blocks, it is quite likely the size will be twice 0.5M base. This is because indirection is efficient only inside a pack of blocks.

On the other hand, if we ever start to build cross sites data blocks, the size will be less than the size of naive zlib.

Speed

The zlib version is the fastest one: we only query the docId, then the compressed block, and we uncompress it. It is even faster than querying for the raw block, because of the smaller network traffic.

The exploded version is clearly a LOT slower, but stays under 0.5ms per request for the vast majority.

Bottom line

Do not over think your data schema when storing to Redis, it is faster than you may think, and a simple software compression may help you contain your data growth.

If you want to optimize data weight, you want to read Memory optimization from the Redis documentation.

Share
Posted in General, Techno | Comments Off

Responsive Design as mobile strategy

Nowadays, one can’t have a website withtout thinking of the mobile strategy you need. This is even more true when you act as a heavy audience actor of the communities and News groups of the web.
I’ll tell you a secret. Mobile audience is growing. Every day. Ok, it’s not really a secret for you, but i can share with you some of our stats: in the last 3 months, our users coming to our sites from a mobile device represent between 5 and 5.4% of all our visitors, and the proportion of smart-phones (iOS and Android plateforms) has exploded and keep  growing constantly.
We had before the old fashioned way to serve to our users a mobile version of our site :  we used to call it mobile as it was a more simplified version of our site like Tomshardware mobile. But we never got very satisfied of the result. The lack of updates on this version and, we have to admit it, the not-so-sexy display has let us a little frustrated.
We are currently working on a full redesign of our site (as you may have noticed this year we already have migrated our Forums). So the opportunity to try to get our current version of the site that is responding and adapting to the end user device comes up with all the cool things that now can be handled with the new technologies that we intend to use.
By combining the power of CSS3 media queries and the possibility to detect and refine users through the Client User Agent, we intend to modify and enhance the current (mobile)user experience.
So we combine two ways:

The first one is to get a more Responsive Design. It is based on the approach that the look and feel, and the navigation should adapt themselves to the end user environment based on its screen size, platform and orientation.

The second one, Server Side Components, says, in a nutshell, that the server should be able to adapt the code sent to the end user based on some scalars and criterias.
We chose to rely on the User agent information for this, in order to send or not some components of the pages. As for mobile strategy, the main chalenge is to send to the user only the data he needs. This options allows us to send only components of the page.
Our pages are splitted into components, that we chose to display and thus, to send or not the corresponding HTML data.

This Server side component strategy is currently still in progress, so i’m not able to show you anything yet. But we will do it soon.
But we can show you more easily how we will use the first strategy: as a demo is always better than words, let’s say it in real with the Demo page.

This post is the first about mobile initiative at bestofmedia, we will keep you informed with any new feature available on the site…

Ressources :

Share
Posted in General, Innovation, Techno | Tagged , , , , | Comments Off

PHP refactoring in legacy code

The story we’ll talk about is a true story. It happened to be challenging and helped the team keep testing its beliefs in XP, iterative developments and code quality.

Product elevator statement

Imagine a well legac”ied” project you don’t know.

  • Product is a web forum with millions of messages.
  • We want to rebuild the categorization mechanism (messages are “categorized” meaning they are assigned to a category that best describes their content).
  • Mission : fix all bugs
  • “Short delay” and “no regression” are the words.
  • Only few people share the knowledge of the categories system to be refactored.
  • Numerous bugs (useless to mention that several generations of developers brought contributions to the project).
  • 20 commiters.

Background

From the team’s point of view, here are the goals we anticipated we needed to achieve:

  • Understand the expected behavior of the categorization mechanism
  • Bring no regression to the actual behavior
  • Replace the old mechanism by a new one

First decision we took was to use Git to work on that project. We won’t explain in details that choice (20 commiters, we wanted to avoid working in a dedicated branch for weeks and commit in the HEAD trunk of the project…). It has already been discussed here.

Refactoring strategy

As a Team, we decided to do the refactoring as follow:

  1. With the Product Owner, write BDD scenarii describing how the categories mechanism works
  2. Switch on the “Test Harness” by automating (implementing) the BDD scenarii
  3. Encapsulate ALL calls to the old categories mechanism behind an API (adding Unit Tests to that new API aswell)
  4. Based on the API contract, build the new mechanism relying on a new categories data model

1. Write BDD scenarii to describe the categorization behavior

First two weeks were spent “extracting” all the possible knowledge from the Product Owner about the product and translate it into BDD scenarii.
Example:

Given I am a visitor
When I go to url "http://www.infos-du-net.com.sf/forum/"
Then below the meta-category "Multimédia", I have the following sub-categories with content
| cat name                 | decrypted url                                      |
| Image et son             | http://www.infos-du-net.com.sf/forum/forum-20.html |
| Appareils photo, cameras | http://www.infos-du-net.com.sf/forum/forum-47.html |
| Consoles                 | http://www.infos-du-net.com.sf/forum/forum-29.html |

At the end of this step:

  • 100 BDD scenarii written
  • Shared knowledge of the expected application behavior

2. Switch on the “Test Harness”

We used Behat (PHP based) to implement the scenarii.

Some of the scenarii written with the Product Owner describe a behavior involving integration with third-party systems. They were not implemented because such tests, seen as “integration tests”, were seen as complicated and hard to maintain. We preferred to invest on Unit Tests by Contract (as well explained by JBrains).

Some scenarii were implemented but not automated because describing a behavior that highlights a bug or describing the future behavior. They got RED at the time of the implementation and would go GREEN by the end of the project.

At the end of this step:

  • The “Test Harness” is switched on !
  • Thanks to the Continuous Integration Platform, we are able to frequently test the categories behavior and ensure we will not break anything during the refactoring.

3. Encapsulate old categorization mechanism behind an API

Example of code BEFORE encapsulation (old DAO was FrmCategoryTable)


public function executeIndex(sfWebRequest $request) {

$categoryList = FrmCategoryTable::getForumList($idSite, $culture, $user);

}

In order to better test and avoid perturbation with other commiters, we’ve encapsulated all calls to the old category mechanism behind a new API.

We keep the calls to the old category mechanism, but we isolate them into a dedicated API.

Example of code AFTER encapsulation (new API is categoryProvider)


public function executeIndex(sfWebRequest $request) {

$categoryList = $this->categoryProvider->getAllCategories($culture, $brand, $country, ICategoryProvider::SERVICE_FORUM, $user);

}

Code that implements the new API


class CategoryProvider implements ICategoryProvider {
public function getAllCategories($culture, $brand, $country, $service, $user) {
$categoryList =
CatBrandAndCountryTable::getInstance()
->getAllCategories
($culture, $brand, $country, $service, $user);
return $categoryList;
}
}

At the end of this step:

  • The old mechanism is isolated behind an API
  • The “Test Harness” is still switched on !

4. Based on the API contract, build the new mechanism relying on the new categories data model

During the encapsulation step we’ve created the API that is the CONTRACT of our categories mechanism.
At this time we made the choice to start the implementation of the new API. It was probably not the best choice because for several days the new behavior was only partly implemented. We should have worked on another implementation of the API based on the CONTRACT we had extracted from the previous step.

Only once this is done, we should have switched from one implementation of the API to the other.

Code that implements the new API


class CategoryProvider implements ICategoryProvider {
public function getAllCategories($culture, $brand, $country, $service, $user) {

$categoryList = FrmCategoryTable::getForumList(
$siteId, $culture, $user, $categoryLevel);

}
}

At the end of this step:

  • The new mechanism is plugged (new DAO CatBrandAndCountryTable)
  • The “Test Harness” is still switched on !

Conclusion

  • Quite a big system was refactored without service interruption
  • No merge conflicts because we always committed in the trunk/HEAD
  • No projects conflicts because we isolated the pieces of code that were aimed to be re-factored
  • The writing of BDD scenarii WITH THE Product Owner helped implementing the right behavior and sharing the knowledge.
Share
Posted in General | Comments Off

Conference Agile Grenoble 2011 – PHP symfony in an Agile environment

Conference Agile Grenoble 2011 took place on the 24th of November 2011. At this occasion, some of our engineers gave a presentation. Below is the summary. For more details, have a look at the slides (slides in French) or/and download the technical ressources .

Having millions of friends, comparing millons of offers or publishing millions of news are as many different applications written in PHP. Often criticized, sometimes called “language for dummies“, it remains the first choice for web solutions. If you know some good tools and some good methods, PHP is modulable, testable and easy to deploy. Thanks to technical examples and based on real life projects, experience how to play the “PHP symfony in an agile environment” (in french “symphonie pour PHP industrialisé en agilité majeure”).

Menu:

  • Build a boilerplate of a PHP project aiming not to throw anything at the end
  • Get the control back on your frontend project and have a refactoring strategy built on tests
  • “Not only working software, but also well-crafted software” – Manifesto for Software Craftsmanship
Share
Posted in Agile, CI, Continuous Delivery, Continuous Improvement, Software Engineering, TDD, Test Automation, XP | 1 Comment

Pretested commits – why does it matter to us?

The problem

Our CI was frequently red and that creates work. During a two-week period we measured that 54% of the builds resulted in failure. More than half of the failures followed a previous failure. That means that by a low estimation 25% of the commits were made while the CI was red.

Ok but wherein lies the problem?

  • It’s more difficult to fix the problems when they’ve been there for a while
  • As time passes it’s becomes unlikely that the guy who broke the build will be the guy to fix it. And it’s definitely more difficult to fix other peoples bugs.
  • The fact that I commit while the CI is red, means that I won’t get the necessary feedback. I can easily introduce a new error that doesn’t get caught until the first one is fixed.
  • When a developer updates his code he also gets other peoples bugs, so when he discovers a problem he can’t be sure it came from his modifications.
  • It is a reason to maintain extra branches (one for development, one for patches to production)

So, indeed it creates work.

Analysis

So lets just establish that everyone always runs the tests before publishing their work, right?
- Tried it, doesn’t work.
Well not?
- Because it’s always tempting not to.
Why?
- Because it’s difficult to be sure whether it’s your bug or someone else’s.
Why?
- Because other developers commit untested code.
Oops! Vicious circle.
But that’s not the only reason, what’s more?
- It takes time to run the tests and it blocks the development environment.
Nasty. Any more?
- It takes discipline to run all the tests before every “publication” of my code and discipline is a limited resource. Someone will run out of it and then it will get a lot more tempting for the others to skip it.
Oops! vicious circle again.

So establishing a run-your-tests-locally-before-commiting-or-else-shame-on-you culture isn’t going to work for very long.

Solution

Since about two years a workflow called pretested commit, delayed commit, private build or stable build is emerging. It’s even a feature of TeamCity CI. What’s so cool about this is that it’s not a countermeasure to the problem “developers commit untested code”, it eliminates the problem all together by removing the root cause “running tests blocks developers” altogether (the fancy japaneese term is poka yoke). The basic workflow is that all tests are run before the commit to the shared development branch actually happens – let’s call it the stable branch. Thus ensuring that the latest version of the stable branch always contains code that passes the tests. It also means that whenever you update your code you don’t get bugs from the others. In fact if there are any bugs, they’re all yours!

The way we choose to implement the commit barrier was to first migrate the whole project from SVN to Git and then use the Jenkins Git plugin to configure the following workflow.

Say a developer wants to start a new feature.

  • He starts by checking out a new feature branch
  • he commits some modifications locally (since we’re using git)
  • he does some more work and commits again
  • then he pushes his branch to the team repository refs/heads/merge-requests/<my-name>/<branch-name>
  • Jenkins takes the branch merges it with the stable branch
  • if the code doesn’t merge cleanly nothing is done to the stable branch and the committer is notified by mail.
  • if any tests fail, then again nothing is done to the stable branch and the committer is notified by mail.
  • if the build succeeds, the stable branch is updated* with this latest stable version.
    (* : in git branches are like post-its that you can move around, so “updating the stable branch” just means “move to post-it named stable to the just tested commit”)

It must add that this was what fit our reality best, there are many ways of doing it. For instance if you got a really fast build that runs in isolation you might want something simpler.

So we get an always stable branch. That is, with respect to our automated tests. No more useless work created by untested commits.

Of course there are some things that don’t get into this pretested commit feature. We have some long running tests that are not convenient to include. Our build time, for a successful build, is currently 30 minutes, which is already long. It monopolizes a shared resource so we can’t run two of them in parallel. If we’d increase the test suite to encompass later stages like deployment and smoke tests to pre-production platforms it’d take longer and would make the queue to have your merge request build longer.

Another interesting fact is that the faster the test suite is the more tests we can move into to the pre-commit build and the less costly any mistake is. What a clear connection between a fast test suite and productivity!

Results & learnings

The build is still red sometimes, but that just means that someone monopolized the CI for a couple of minutes. So it’s not much of an issue.

A change like this, involving 3 teams, 2 projects and 15+ developers is not so easy to get going. It takes enough analysis and measurements to have a majority agree that there is a problem worth solving and that there are good enough solutions to it. For instance the fact that the CI is red is not a problem in itself. The real problem is the extra work that is created by the mechanisms described in the beginning of this post.

Still, sometimes the commit barrier can be annoying, since it’s now more difficult to supply a patch – it actually has to pass the tests! No-shit. This is of course good for a majority of modifications and for the developers as a group. But in some edge-cases (like a i18n fix) it can be an annoyance to the individual developer.

The actual setup is surprisingly easy with Git. We used the Jenkins Git plugin but it’d take a couple of lines of shell to do almost the same.

Probably the biggest difficulty was to modify the deployment scripts in safe manner. Modifications were done by a developer and testing and switch were made by ops. Don’t split such a task between two groups unless you have utterly fluid communication between them.

Next steps

Theoretically we’re now able to always deploy from the stable branch. Even patches can go in here because the development branch is always in a fit state, no more production branches. In practice we’re not quite ready but it is our next reachable step.

We also need to work on reducing the build time. As the application grows we will add tests, so if we don’t speed them up our build time will grow.

Share
Posted in Agile, CI, Continuous Improvement, General, Test Automation, XP | Comments Off

Surfing the Wavelets : Multi-Scale Innovation

The nice thing about Wavelets is that they allow to study signals at multiple resolutions. This is exactly the same idea we nurture with Innovation. Let’s take the tour…

Today is the kick-off of the in-house PhD we set up on Web Buzz mechanics. When you think about what you can do in a 3 years PhD program, you might think that it’s not the kind of schedule that fits in a Web startup-like company. A PhD could be compared to running a Marathon : it will take a fair amount of time and pain to reach the finish line and you’d better keep a steady rhythm to post a good time. Gaining knowledge and expertise in a challenging field, developing new algorithms, experimenting with real-world data, feeding back theory, on and on…

Long distance running...

On the other side, a Lean-Agile company like BoM (see here or there) is committed to embrace change and meet challenges every day or so. For sure our R&D teams keep a rhythm that is a little more upbeat than the one of a PhD. Daily activities resemble those of a bees hive : standups, short iterations, delivering projects, testing (first of course!), lost of interactions, taking up new projects as soon as they hit the ground.

Scrum Board

Then is there really a contradiction ? I don’t think so. Lean always insisted on the importance of building deep knowledge. And this is exactly what a PhD is for : developing theory(ies) on a given field, based on experiments and scientific evaluation and confrontation with leading thinkers. Of course it does not mean we stop being Agile or reactive because we take the time to build such deep knowledge. In the contrary, we manage both scales and organize things so that each one can benefit from the other.

A first example is Standups. Participating in standups allow PhD candidates to confront what ideas (s)he may have about any featured topic with the ground reality faced by practitioners. I’m convinced it can avoid a few perils when taking ideas from the lab to the live testing ground.

A reverse example is the building of a Roadmap. Frontline engineers have a number of ideas about what should be done to fix the most urgent problems and capitalize on the company’s technical assets. Scientists have a list of brand new, cool models and algorithms they want to play with. Managers may have more long term ideas for the company and a prioritized list of high level goals. And PhDs may have what it takes to feed everyone with facts and theory that are able to get to the bottom of things. Which everyone will digest and utilize to make the whole business model evolve towards something more robust in the long term.

Indeed, this is really what this kind of organization is all about : addressing the short term challenges with the ferocity of the lion and building for the future with the wisdom of the elephant.

 

 

Share
Posted in Agile, Innovation, machine_learning, Science | Comments Off

Development to production pipeline

As I mentioned in an earlier post on test automation strategy we want code to flow smoothly into production in small increments in an automated fashion.

To be honest I don’t know whether we’ll do continuous deployment all the way, once hundreds of thousands of users depend on our site. I sure hope so, but as one of Bestofmedia managers says “By aiming for it we’re sure to end up with a very healthy and automated process. Developers will take on a very qualitative attitude to their work”. What’s more, those who do it IMVU, Digg4 and Flickr for instance are very satisfied with the results.

We’re not a big company that can afford to have dozens of engineers work for months on a CI and deployment infrastructure, but we’re a small dynamic and fit company and that’s all we need because we look for simplicity. Here’s the autonomy of our current system.

A walking skeleton

So we start of on a new product and instead of jumping into development right away we start by creating the walking skeleton as described in Freeman and Pryce’s excellent book GOOS. It’s a skeleton because it does the simplest possible thing, a “Hello world” for instance. But it is walking because it is under version control, it runs its unit tests, some sort of functional tests, the code is continuously inspected by Sonar and it gets deployed to some sort of production like environment. All of this is scripted ofcourse. By the way that was our first story.

Code review by pair programming

We pair-program for most development work, we don’t force ourselves to do it 100%, but most of the team members feel that pair-programming is beneficial for

  • ensuring code is always reviewed
  • reduce the number of bugs getting into the system
    • Edsker Dijkstra – The humble programmer (Turing Award 1972)
      If you want more effective programmers, you will discover that they should not waste their time debugging, they should not introduce the bugs to start with.
  • not getting blocked
  • remove need for technical documentation
  • avoid SPOFs on the team (every part of the application is known by at least two people)

We also try to analyze our pair programming to get better and better at it. For instance, we’ve found it very useful to have a second laptop per pair so that the pair can split and join again without moving. Sure pair programming is sometimes inefficient, but in order to be a good judge of where it serves, we have to practice it in the first place.

The pipeline

This is what we thought we needed

As it turns out we didn’t need all that, well at least not yet. It turns out that we don’t create versioned binaries (the release stage above). The risk is that we can’t rollback to a previous binary and we can’t be certain that the combination of binaries tested in the QA stage is in fact the one we deploy to production. Sounds dangerous! Well it isn’t, at least today it isn’t. It didn’t happen yet and we’re more concerned to get features out to our alpha users (and it won’t hamper our speed later on).

Anyway the workflow is as goes : the pair commits some code to revision control. A Hudson job runs unit and framework tests before building snapshot binaries. These binaries get published to a mix of Artifactory, a Debian repository and a ftp server.

The QA Sanity job deploys the application to our QA server and runs our sanity tests (see previous post).

The preprod sanity job deploys to servers that are not in production but in a very production like environment. That is no easy root access. It runs the same sanity tests and if all is successful proceeds to production where we run the sanity tests again.

Latest we looked this cycle ran 48 times a day.

What if something goes wrong in the pipeline?

The hudson job goes red and the team is notified by mail. Tasks further down the pipeline won’t run until the previous ones are green. In the future perhaps we’ll need to make an automatic rollback if the deployment phase fails. But lets wait and see.

Share
Posted in Agile, Continuous Delivery, General, TDD, XP | Tagged , , , | Comments Off

Use Avro with Dumbo for Hadoop jobs

At Bestofmedia, we run a lot of jobs on Hadoop to process and analyze data from our web sites.
We use Hadoop Java implementation for industrialized tasks but we also have a need to write simpler jobs for quick experiments.

At first, we started using Hadoop Streaming and open sourced a component to read and write Avro files with Hadoop Streaming.
This works fine but using Hadoop Streaming still requires some boilerplate code (especially if the mapper and reducer are written in Python).

We have now started to use more intensively Dumbo which is a Python library to write map/reduce jobs and run them on Hadoop.
Dumbo allows to write jobs as very simple python code and still leverage a lot of the Hadoop features:

def mapper(key, value):
    yield value.split(" ")[0], 1

def reducer(key, values):
    yield key, sum(values)

if __name__ == "__main__":
    import dumbo
    dumbo.run(mapper, reducer, combiner=reducer)

 

Since our data are stored in Avro files, we have written Java utilities classes to use Avro as the input and output formats of Dumbo jobs and we have released it in our Open Source avro-utils project.

Avro input for Dumbo

To use Avro files as input for Dumbo, use the jar generated by the avro-utils project (as well as avro and avro-mapred jars) and set the input format to com.tomslabs.grid.avro.AvroAsTextTypedBytesInputFormat:

$ dumbo start <PYTHON_SCRIPT>                                            \
     -input /tmp/word-count.avro                                         \
     -output /tmp/out                                                    \
     -libjar avro-1.5.1.jar                                              \
     -libjar avro-mapred-1.5.1.jar                                       \
     -libjar avro-utils-1.5.3.jar                                        \
     -inputformat com.tomslabs.grid.avro.AvroAsTextTypedBytesInputFormat \
     -hadoop <HADOOP_HOME>                                               \
     -python <PYTHON_HOME>


The Python script’s mapper will get the Avro record as a JSON string in its value parameter (the key parameter is not used).

Avro output for Dumbo

You can also use Avro files as the output for Dumbo. This expects that the reducer will emit in the value a string containing the JSON representation of an Avro record. To store in Avro (binary) files instead of text file, when you start Dumbo, you must specifiy the properties:

   -libjar avro-1.5.1.jar                                                \
   -libjar avro-mapred-1.5.1.jar                                         \
   -libjar avro-utils-1.5.3.jar                                          \
   -outputformat com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat \
   -hadoopconf avro.output.schema=$SCHEMA                                \

 

where SCHEMA is a String containing the JSON representation of the Avro schema to use to create the Avro records.

Please not that to use these input and output format, you must run Dumbo on Hadoop; executing Dumbo locally will not use the specified input/output formats.

We are very pleased at Bestofmedia to contribute back to the Open Source communities which help us to do our job everyday.
Do not hesite to clone our project and contribute back, report issues or enhancements:

git clone git://github.com/tomslabs/avro-utils.git

 

Share
Posted in hadoop | Comments Off

PhD Position : Statistical Modeling of Web Buzz Mechanics

Here at BestOfMedia we take to our heart our mission of better understanding the mechanics of our business : Online, Social Media. And we do that by applying scientific methods that produce strong, reproducible results.

Of course we’re very keen on Machine Learning and believe this scientific field to be of utter importance in our business. Our R&D team already produces interesting results with such methods on many parts of our business (more about that in later posts) but there is one topic that is central and yet needs a long-term focus to be correctly analyzed : Web Buzz. Indeed, correctly analyzing buzz mechanics on the Internet and on Social Networks may explain and make reproducible mechanics that drive the audience to or from our sites. Truly, it’s a central topic !

Also, as we draw heavily from all the existing Web Science (think of WSDM or WWW etc) literature on other tasks we’d like to contribute back to the academic world. Another benefit of a PhD is that although integrated into the R&D team, he/she may have different and maybe more longer term views on the topic.

For all these reasons we are opening a PhD position, integrated into our R&D team on the subject. The extended subject is here : phd-bestofmedia-buzz. But before I tell you more about it, just let me shout this : this subject is too cool ! it’s the kinda stuff I would love to have an opportunity to work on when I was a student.

The objective of the PhD is to propose models that explain the emergence of new topics in news or RSS feeds as well as social networks and other sources of online information. A second and most important goal is to be able to rank the potential of a given topic to create a buzz or burst in interest.

The ideal candidate could look like :

  • passionate about the Web
  • Master’s degree in Machine Learning or related fields
  • not afraid of big data (getting, cleaning, munching, digesting)

and ideally,

  • up-to-date knowledge on trendy, scalable algorithms (LDA or RBM anyone ?)
  • open yet ambitious personality

drop me a line at ediemert (at) bestofmedia (dot) com if you’re interested !

 

Share
Posted in Innovation, machine_learning, Science | Tagged , , | Comments Off