Until now — at least in the research lab. Facebook AI researchers have built a system that can analyze a photo of food and then create a recipe from scratch. Snap a photo of a particular dish and, within seconds, the system can analyze the image and generate a recipe with a list of ingredients and steps needed to create the dish. It can’t look at a photo of a particular pie or pancake and determine the exact type of flour used or the skillet or oven temperature, but the system will come up with a recipe for a very credible (and tasty) approximation. While the system is only for research, it has proved to be an interesting challenge for the broader project of teaching machines to see and understand the world.
Their “inverse cooking” system uses computer vision, technology that extracts information from digital images and videos to give computers a high level of understanding of the visual world. Those smartphone apps that allow you to identify plant and dog species, or that scan your credit card so that you won’t have to tap in all the numbers? That’s computer vision.
But this is no ordinary computer vision system: It has extra gray matter. It leverages not one but two neural networks, algorithms that are designed to recognize patterns in digital images, whether they are fern fronds, long muzzles or embossed characters. Michal Drozdzal, a Research Scientist at Facebook AI Research, explains that the inverse cooking system splits the image-to-recipe problem into two parts: One neural network identifies the ingredients that it sees in the dish, while the other devises a recipe from the list.
Drozdzal says this enhanced computer vision system is more effective than retrieval image-to-recipe techniques, which work to recognize the tasty treat in question and then search a database of preexisting recipes. “Our system outperformed the retrieval system both on ingredient predictions and on generating plausible recipes,” says Drozdzal.
The proof is in the paella: Drozdzal and fellow Facebook AI Research scientist Adriana Romero, who met while studying for doctorates at the University of Barcelona, claim their system might even trot out a decent recipe for the devilishly complicated Spanish rice dish.
That is no mean feat, because food recognition is one of the toughest areas of natural image understanding. Food comes in all shapes and sizes — what AI scientists call “high intraclass variability” — and changes appearance when it’s cooked. Take an onion, which could be white, yellow or red. It can be sliced into rings, slivers or chunks. You could bake, boil, braise, grill, fry or roast an onion — or just choose to eat it raw. A sautéed onion will be translucent, but sizzle the vegetable in butter, sugar and balsamic vinegar and you get brown manna from heaven: caramelized onions. Just like an onion, there’s another layer of complexity: The vegetable is certainly present in paella, stews and curry, but it is often invisible to the naked eye.
This is why any system of visual ingredient detection and recipe generation benefits from some high-level reasoning and prior knowledge: A standard paella contains some quantity of chopped and fried onion, a cake will likely contain sugar and no more than a pinch of salt, and a croissant will presumably include butter.
Previous image-to-recipe programs were a bit more simple in their approach. In fact, they thought more like gopher librarians than like Le Cordon Bleu chefs. Drozdzal explains that these less sophisticated systems merely retrieved a recipe from a fixed data set based on the similarity of the photo to the images on file: “It was like having a photo of the food and then searching in a huge cookbook of pictures to match it up.”
Naturally, the success of this method depends on the size and quality of the cookbook, the handiwork of both the photographer and the chef, and some pot luck. “It is hard to match if a recipe isn’t in the data set and the image or dish appearance are different to the data set” says Drozdzal. In other words, the retrieval approach is like finding a needle in a haystack when the system doesn’t know what a needle looks like.
Drozdzal and Romero were convinced there was a better method. They wondered what would happen if they built in an extra step to the recipe generation pipeline: a system that could predict the ingredients.
The ingredient-predicting network works more or less according to the problem-solving principle of Occam’s razor: that the most plausible-seeming explanation is probably correct. For example, Drozdzal, Romero, and their team took the Recipe1M data set, which has nearly 17,000 ingredients, and whittled it down to a more manageable 1,500. They also trained the model to predict that certain ingredients often appear together, like salt and pepper, cheese and tomato, and cinnamon and sugar.
The recipe-generating network also works from the Recipe1M data set, which the team slimmed down from around 1 million recipes to approximately 350,000. Recipes that made the cut all contained images and had two or more ingredients or instructions. The data set furnishes the neural network with a vocabulary of nearly 25,000 unique words in addition to the information from the image and the ingredient list. The network also analyzes the interplay between image and ingredients for insights on how food was processed to produce the resulting dish.
Bingo — recipe prediction is now a game of divide and conquer rather than trial and error. Back to that paella: The first neural network might recognize rice, onions, tomatoes, and, depending on the generosity of the chef, some seafood. The second neural network starts generating a recipe from the inferred ingredients: Slice and fry the onion; stir in bomba rice; add chopped tomatoes and, finally, some prawns and mussels. The entire system is bringing its own high-level reasoning to bear on three sources of information: the image, the corresponding list of ingredients, and the system’s own prior knowledge. It makes well-educated guesses rather than turning recipe generation into a giant identity parade.
The inverse cooking project is already outperforming the retrieval approach. Drozdzal and Romero’s paper cites a recipe for an English muffin laden with cheese, broccoli and tomato. The inverse cooking system aced all the ingredients, while the retrieval system identified only the cheese and tomato. (The retrieval system also saw a cracker, some lettuce, and some Miracle Whip.) Around 55% of humans also judged the inverse cooking system’s recipes to be successful, compared with approximately 48% for the retrieval approach.
The research project will have educational as well as epicurean benefits, says Romero: “The food that we consume nowadays has changed from being home-cooked to takeout, so we’ve lost the information about how the food was prepared.”
And it is not just foodies who stand to benefit from the principle of conditional generation, or neural networks that generate outputs from two modalities: the image and the inferred text. “If you had an image of people participating in some kind of activity, the neural networks could tell you what they’re doing,” says Romero. For example, a system could analyze a photo of a street performance in Barcelona — in conjunction with inferred information about the movements of performers and their costumes — to tell tourists that they’re watching a sardana, a Catalan folk dance.
In the meantime, the inverse cooking creators are continuing to fine-tune the system. “Sometimes it can’t predict an ingredient, which means that it won’t be present in the recipe,” says Drozdzal. They also want to train the system to deal with the problem of visually similar foods, whether they’re spaghetti and noodles, mayonnaise and sour cream, or tofu and paneer.
Romero adds that she and Drozdzal still haven’t taken the final, most important step in their inverse cooking system: “We haven’t got around to cooking yet.”