Open recipe database: how to gather, cure and store data

I recently started to think about creating an open recipe database. The major consideration coming with this project is how to gather, cure and store recipes.

Gathering data

Let’s start with something obvious: we cannot rely on users to create recipes from scratch. This is a very time consuming task, so the first approach would be to find a way to fully automate that.

As I mentioned in the previous article, the main goal of this project is to keep the recipes informations in a highly structured way. Even if we can easily scrap recipes from the internet, it is usually hard to analyse a full text format in order to fit it in a very structured and restricted database. It would involve some advanced machine learning techniques which I do not master at all and will highly raise this project’s complexity.

In order to keep things simple, we need to find a way to let users cure the unstructured information for us.

We can find loosely structured recipes databases all around the internet. They usually come under two flavours:

  1. Websites formatting their HTML using hrecipe. It basically use html classes to add some semantic informations and metadatas about recipes. Ok, it is more a way to add semantical informations on a website than a database per say, but it makes scrapping these website easier.
  2. Open Recipes Format based database. Fictivekin’s Open Recipes seems to be the more complete one even though it is more intended to be used as a recipe bookmark as mentioned in the repository. This format is YAML based.

Neither of these formats are strict enough for this project but it will be a good starting point which would need to be human curated.

Storing data

We will not go in depth here, we will just identify the various entities. This list is not an exhaustive one and there are obviously lot of dependencies between these entities, they will not be detailed in this post.

  • Recipe: obviously
  • Ingredient: it will contain one ingredient. Each ingredient should be unique in the database, we should avoid as much as possible duplicates.
  • Nutrition information: it will contain various nutritive informations. It will be associated with the ingredient entity.
  • Diet information: is this vegan/kasher/halal/vegetarian/watever friendly.
  • Technique: cooking technique, such as grill, peel, cut, etc.
  • Cooking step: Combination of a technique + several ingredients.

There will also be “links” between these entities, for the moment, I just identified one of them:

  • Substitutes: link between two ingredients. This link will probably also contain informations such as weight/volume ratio, etc.

Curating data

Now we have some data, we need to cure it in order to make it fit in our strict table scheme. The idea here is to let users do that, a bit like reddit let users cure internet links.

We will need to design a proper algorithm here and I will not get into that until another post, however, the outline will be the following:

  • Each entity (ingredient, recipe, …) will be associated to three states:
    • New: a user is editing the entity, it is not ready for vote.
    • Voting: the entity is currently candidate for integration in the database.
    • Rejectted: the entity has been rejected, it will not be inserted in the database.
    • Integrated: the entity is integrated in the database.
  • Every user is associated to a rank (karma) which will weight the votes.
  • During the voting state, users are invited to upvote or downvote the entity.
  • When an entity score meets a threshold (1 or -1), it will be accepted or rejected.

Tricky parts

Every entity will need to be voted in order to be integrated in the database. This will probably prevent us to integrate duplicates and crappy data. However, this may also discourage users from posting as they will not know if their work will be integrated or not.

Another problem will be entity dependencies: recipes will depend on ingredients, it means that in order to be included in the database, a recipe needs all its ingredients being already included. What if some ingredient are rejected, how should we handle that? I need to put more thoughs on that later…

Do we need moderators? I would prefer not in order to keep the database as neutral as possible, but depending of users activeness, it could be needed.

Coming next

Alright, enough talk, we need to start implementing that.

Next step: entity system implementation.