Introduction

The rapid growth of food-related big data, driven by social media, the Internet of Things (IoT), and Artificial Intelligence (AI), has given rise to an interdisciplinary research field known as food computing1. Due to its significant implications for human health, diet, and disease, food computing has become a key area of focus in disciplines such as computer vision, multimedia, medicine, health informatics, agriculture, and bioengineering2. First introduced in 2015, the term “food computing” encompasses several major tasks, including food recognition, retrieval, and recommendation3. Among these, food recognition is foundational, involving the identification and classification of food items from visual data (e.g., images or videos), which is crucial for applications such as nutrition tracking, food authentication, smart restaurants and supermarkets, and waste management. Particularly in response to the rise in diet-related diseases, various tools are being developed to enable fast and accurate food logging for dietary monitoring3. Increasing people’s nutrition literacy and improving dietary habits start with increasing their awareness of current eating patterns. The importance of these solutions lies in the potential to foster healthy dietary patterns and serve as a preventive measure against chronic diseases including obesity, diabetes, and cardiovascular disease (CVD), which is mainly caused by the high intake of red meat and processed meats including obesity, diabetes, and cardiovascular disease (CVD), which is mainly caused by the high intake of red meat and processed meat4,5.

According to the Global Burden of Disease Studies, the incidence of diet-related CVD deaths and Disability-adjusted Life Years (DALYs) in 2019 amounted to 6.9 million and 153.2 million, respectively, indicating a substantial increase of 43.8% and 34.3% since 19906,7. Many countries with high diet-related CVD deaths and DALYs deaths are situated in Central Asia, while the lowest death rates are in the high-income Asia Pacific region. Among the specific dietary patterns identified as having a major contribution to CVD deaths and DALYs are diets characterized by high salt intake, excessive empty calories, insufficient fruit, nuts, seeds, and vegetable consumption, and diets with low Omega-3 levels4,7,8.

Over the last decade, Artificial Intelligence (AI) has found a wide range of applications in the food industry and agriculture9. In combination with other components such as smart sensors, big data, and blockchain technologies, AI is being used in various stages of the food chains, such as food classification, production developments, quality testing and improvement, supply chain management, and food safety monitoring10. For example, in food safety, machine learning has greatly improved the detection of potential contamination sources during production11. Convolutional Neural Networks (CNN) can assist in automated quality control, preventing inferior products from entering the market by precisely identifying small defects or foreign objects on food items. Deep learning (DL) technologies like recurrent neural networks (RNN) and long short-term memory networks (LSTM) were used for real-time monitoring and detailed classification of product appearance production lines, enhancing overall product quality10. In agriculture, several AI-enabled surveillance systems have been developed to assist farmers with detecting pests and monitoring crops and soil issues to maximize the harvest yield12,13. Precision agriculture is getting interested, which uses AI systems in all steps from planting and watering to the final crop harvesting14,15,16. To automate the manual crop collection process several computer vision (CV) models have been developed for fruit and vegetable detection17,18,19. Another work has been done on the ripeness assessment of the fruits and vegetables, which is essential not only in the fields to automate the harvest collection process but also in the stocks and food chains for the products’ freshness check20.

Besides this, the development of CV significantly increased the functionality of innovative tools, such as mobile applications, to offer convenient ways for individuals to automatically monitor and manage their nutritional intake21,22,23. In this work, we are focusing on the application of AI to facilitate nutrition literacy and dietary interventions. Empowering individuals to make well-informed decisions regarding their dietary selections can help reduce the burden of dietary-related diseases and enhance overall health outcomes. This starts with raising awareness of their current behaviors4,6. Capturing food images not only reduces the burden associated with maintaining traditional food diaries and benefits individuals lacking access to expert healthcare resources or consultation, but also boosts the social support to achieve collectively healthy eating objectives when shared on social media platforms21,24. Furthermore, food images contain precise contextual information that can be utilized by healthcare professionals to provide personalized diagnoses and treatment recommendations.

The development of automatic food logging using CV requires a high-quality dataset24. A major challenge is associated with the nature of the food domain such as intra-class and inter-class variability, visual characteristics, and high variability in shape. The problem of diet tracking using CV can be addressed in two approaches: image classification problem and object detection problem. In the first case, the entire image is classified by a single category or class. In this case, a single food item must be present on an image. Several web-crawled publicly available food classification datasets have been released, such as the Food-101 dataset25, which features 101 European food categories with 1,000 images per class, establishing itself as a benchmark for numerous recognition models6,26. Furthermore, the ISIA Food-500 dataset containing Asian, European, and African cuisines contains 500 food categories with over 400,000 images27. A DL-based food recognition system has been developed and trained on a dataset of 400,000 internet-scraped images, identifying 756 food classes predominantly consumed in Singapore28.

The second approach of food detection is a more complex task, since, it combines object classification and localization sub-tasks. In this case, food scenes containing multiple food items are considered. For this approach, very few publicly available datasets have been created. For example, the BTBUFood-60 dataset contains 60,000 images with 78,000 labeled instances across 60 food categories from Japanese cuisine29. Another publicly available dataset is UNIMIB2016, which contains 1027 images with a total of 3,616 instances spanning 73 food categories from Western cuisine26. Currently, most works on the application of CV for food recognition solve the classification problem. The main limitation of the classification datasets is that only one food item per image can be predicted, which may not be suitable in real life since food scenes comprising several food items are common. Therefore, for user convenience and fast food logging, food detection and localization datasets and models are required. Yet, no publicly available dataset covers Central Asian cuisine. This study aims to develop a food scene dataset that contains commonly consumed food items in Central Asia and to train a computer vision model for automatic food detection. The dataset includes annotations for each food item, labeled with rectangular bounding boxes and corresponding food class names.

In our previous work, we presented the first Central Asian Food Dataset (CAFD) for a classification task that contains 16,499 images of 42 Central Asian food items30. This work presents the first Central Asian Food Scenes Dataset (CAFSD), containing 21,306 images with 69,865 instances across 239 food classes. The dataset encompasses a wide array of food categories, including local Central Asian cuisine as well as Western, Mediterranean, Chinese, and others commonly consumed in Central Asia. This dataset presents a significant contribution to CV-assisted food and dietary tracking applications, aiming to enhance their utility by encompassing the diverse culinary landscape of Central Asia.

Methods

Dataset annotation

Developing and training an object detection model requires a high-quality dataset (Fig. 1). The food object detection task involves supervised learning, which required the input-output data pairs and model learns different food features and tries to localize the food item and then identify it. In this case, the input consists of images containing various food items, while the corresponding output is represented by label files. These label files encode both the food classes and the coordinates indicating the location of each food item within the image (i.e., the bounding box coordinates). The format of the label coordinates can be rectangular and polygonal and depending on the domain and type of the object to be detection the respective method can be selected. For the food object detection we have selected the rectangular bounding box format, which is preferred for its simplicity, efficiency, and consistency, making it faster and easier to implement, especially in large datasets. It is sufficient for many object detection tasks and requires less computational power than polygon annotations, which are more precise but complex. Furthermore, many object detection models are optimized for the rectangular bounding box format. There are various image annotation platforms and software tools available to generate the label files31. To annotate the dataset with the bounding boxes, we used the Roboflow platform32 due to its convenience in dataset statistics extraction and annotation management. Fig. 2 illustrates the sample annotations of the Central Asian traditional food scenes with the commonly consumed food items such as achichuk, taba-nan, and tandyr samsa.

Fig. 1
figure 1

Statistics of dataset instances across 18 coarse classes used in the annotation pipeline.

Full size image
Fig. 2
figure 2

Annotation sample of the Kazakh traditional food scenes on the Roboflow platform.

Full size image

Before starting the annotation process, we developed the protocol for an efficient annotation pipeline. Due to the large number of classes, we first created an ontology of food categories, and the annotation process was performed in two stages. First, all food items were labeled based on the 18 coarse classes: vegetables, baked flower-based products, cooked dishes, fruits, herbs, meat dishes, desserts, salads, sauces, drinks, dairy, fast-food, soups, sides, nuts and seeds, pickled and fermented foods, egg product, and cereals. In the second stage, the classes were split into finer labels resulting in 239 classes.

Dataset evaluation model

For the CAFSD evaluation and parametric experiment The You Only Look Once (YOLO) model was used, which is a state-of- the-art object detection algorithm renowned for its speed and accuracy33. Unlike traditional object detection methods, YOLO employs a single convolutional neural network (CNN) to predict bounding boxes and class probabilities simultaneously, making it fast compared to other two-step detection alternative models. YOLO divides the input image into a grid and predicts bounding boxes and class probabilities for each grid cell. Each bounding box is associated with a confidence score, indicating the model’s confidence in the box containing an object and the accuracy of its localization. YOLO also employs non-maximum suppression to filter out redundant bounding boxes and improve detection accuracy33.

YOLOv8, an evolution of the YOLO architecture, enhances its predecessor by introducing various improvements. YOLOv8 utilizes a novel CSPDarknet53 backbone architecture, which improves feature extraction efficiency while maintaining accuracy34. Additionally, it incorporates path-aggregation networks (PAN) and spatial pyramid pooling fast (SPPF) modules to capture multi-scale features effectively35,36. YOLOv8 also employs advanced data augmentation techniques and a progressive scaling strategy during training to further boost performance. These enhancements improve accuracy and robustness while retaining the real-time inference capOverall, the training results on the CAFSD suggest a higher capacity for generalizing, localizing, and predicting 239 food classes, showing a noticeable improvement compared to reported results on publicly available datasets.ability that YOLO is known for. Overall, YOLOv8 represents a significant advancement in object detection, offering superior performance and efficiency for various computer vision tasks.

Central Asian food scenes dataset

In this work, we propose the first Central Asia Food Scenes Dataset that contains 21,306 images with 69,856 instances across 239 food classes. To make sure that the dataset contains various food items, we took as a benchmark the ontology of the Global Individual Food Tool developed by the Food and Agriculture Organization (FAO) together with the World Health Organization (WHO)37. The dataset contains food items across 18 coarse classes: vegetables, baked flower-based products, cooked dishes, fruits, herbs, meat dishes, desserts, salads, sauces, drinks, dairy, fast food, soups, sides, nuts, pickled and fermented food, egg product, and cereals. Figure 1 illustrates the overall distribution of the class instances based on these categories.

The dataset contains open-source web-scraped images from search engines (15,939 images) (i.e., Google, YouTube, and Yandex) and our own collected food images from everyday life (2324 images). To additionally extend the number of instances of the underrepresented classes, we have scraped open-source videos and extracted frames at a rate of one frame per second (3043 images). The dataset has been checked and cleaned for duplicates using the Python Hash Image library. Furthermore, we have also filtered out images less than 30 kB in size and replaced them by performing additional iterative data scraping and duplicate checks to ensure the high quality of the dataset.

The proposed Central Asian Food Scenes dataset is diverse in terms of image quality, lighting conditions, and device used (i.e., low-quality snips from mobile phones, high-resolution camera images, etc.). The dataset is unbalanced with a minimum of 40 instances per class and a maximum of up to 3050 instances per class. The Fig. 3 illustrates the distribution of the bounding boxes in the test and validation sets as the number of bounding boxes per image. It can be seen that the majority of the food scenes contain less than 15 bounding boxes per image.

Fig. 3
figure 3

The distribution of a number of bounding boxes per image in the validation and test sets.

Full size image

Figure 4 illustrates annotated food scenes samples based on the annotation rules that we have followed to create the dataset: the liquid objects such as beverages and soups are annotated together with the dish itself (see Fig. 4a), solid food items are annotated without the plate (see Fig. 4a), in case one class is located on top of another class the annotations are made as shown on Fig. 4b; in case one class is obscured by another class and the rest of the background class is not visible we highlight only the visible part (see ‘Salad leaves’ class annotation on Fig. 4b).

Fig. 4
figure 4

Sample annotations based on the annotation protocol to illustrate overlapping and non-overlapping cases.

Full size image

Results

To evaluate the dataset we have performed parametric experiments using the YOLOv8 model, which is a state-of-the-art object detection algorithm33. For the model training the dataset was split into three sets: train set with 17,046 images (55,422 instances and 80% of the dataset), validation set containing 2084 images (7062 instances and 10% of the dataset) and test with 2176 images (7381 instances and 10% of the dataset). While splitting the dataset, the condition of a minimum of five instances per class in each of the sets was followed. Furthermore, considering that the data was originally collected from scraped images and video frames, we split the videos into the training, validation, and test folds, and respective frames were included in each of the sets to avoid data leakage in the model training.

We performed transfer learning on the PyTorch38 platform, using the pre-trained YOLOv8 models on the COCO (Common Objects in Context) dataset with 80 classes39. COCO is a large-scale, high-quality object detection dataset that contains over 330,000 images with more than 1.5 million object instances, which is widely used for training and evaluating machine learning models in object detection and segmentation tasks.

The training was performed on a single Tesla V100 GPU on an Nvidia DGX A100 server. Models were trained for 150 epochs with a learning rate of 0.001, batch size of 16, image size of 640 × 640 pixels, and automatic YOLOv8 augmentation. The trained models were assessed using two metrics: Mean Average Precision 50 (mAP50) evaluates the precision of detection results at an intersection-over-union (IoU) threshold of 0.5, and mAP95 follows a similar principle but considers varying IoU threshold from 0.5 to 0.95, providing a measure of precision under more stringent criteria. Both mAP50 and mAP95 offer comprehensive assessments of object detection performance across various IoU thresholds, with mAP50 focusing on typical object detection scenarios and mAP95 emphasizing precise localization of objects40. These metrics together offer insights facilitating comprehensive performance evaluation.

Table 1 summarises the models’ results on validation and test sets and illustrates the improvements in mAP50 and mAP95 metrics with the model size. The best performance metrics were obtained by training the YOLOv8xl model with more than 68 million parameters, for which the mAP50 and mAP95 scores for the test set are 0.677 and 0.601, respectively. Close mAP50 and mAP95 metrics in both validation and test sets across different model sizes ensure no overfitting occurred during the training. Besides the model accuracy, the model inference time is important as it determines how quickly the model can process data and provide results, improve user experience, and help ensure that AI systems are effective and practical in real-world scenarios. A shorter inference time means the model can analyze information faster, which is crucial for real-time tasks where prompt results are necessary, such as autonomous systems, medical diagnosis, or video and image processing. For comparison, we extracted the inference time using a Nvidia Tesla V100 GPU where the model was trained on a high-performance computer with a Nvidia RTX4090 GPU card. The results in Table 1 show that the inference on the Nvidia Tesla V100 GPU card for the largest YOLOv8xl and smallest YOLOv8n models are 5.1 ms and 0.7 ms, respectively. Whereas on the Nvidia RTX4090 GPU card, the inference time increases from 1.6 ms to 7.8 ms with the model size.

Table 1 YOLOv8n model training results on the Central Asian Food Scenes dataset, model sizes, and inference times.
Full size table

Discussion

One of the main challenges of the CV in the food domain is the variability of food images. Environmental and technical factors like lighting conditions and camera angles, as well as differences in cooking styles and ingredients, make them more difficult than other image tasks. Moreover, fine-grained classification issues arise from intra-class variance, where foods within the same category exhibit diverse appearances due to factors like cooking styles and cultural influences. These challenges highlight the complexity of accurately detecting and classifying food items41,42. Fig. 5 illustrates the intra-class variation samples for the six classes of our CAFSD and their respective mAP50 scores.

Fig. 5
figure 5

Samples of six different classes to illustrate the intra-class variation and their mAP50 scores.

Full size image

Figure 6a illustrates the mAP50 scores variation across classes with different numbers of instances for both validation and test sets. It can be seen that due to the intra-class variation and complexity of the food domain, some classes might need fewer samples to achieve the same class mAP compared to other classes that have high intra-class variation. Furthermore, the quality of the images and the size of the bounding box can affect the output predictions, which can be observed from Fig. 6b illustrating the variation of the mAP50 score metrics for different bounding box sizes for both validation and test sets. Generally, larger bounding box sizes tend to provide more accurate predictions and higher mAP50 scores, likely due to the presence of finer features and details.

Fig. 6
figure 6

Plots illustrate the detailed evaluation of the model performance based on the mAP50 score and the dataset parameters: (a) average mAP50 score for classes with different numbers of instances in test and validation sets.; (b) average mAP50 score variation with the bounding boxes’ diagonal size for the test and validation sets; (c) average mAP50 score per image for food scenes with different numbers of food items.

Full size image

Besides the above-mentioned challenges such as intra-class variation, several other factors affect the model performance such as a cluttered background in the images and a large number of bounding boxes per image. Figure 6c shows that in general, the mAP50 score decreases with the number of bounding boxes per image in both validation and test sets.

Overall, the training results on the CAFSD suggest a higher capacity for generalizing, localizing, and predicting 239 food classes, showing a noticeable improvement compared to reported results on publicly available datasets. From the parametric experiments performed using our CAFSD, the best results were obtained using the YOLOv8lx model (i.e., mAP50 of 69.9% and 67.7% for validation and test sets, respectively). As for the other detection datasets, it has been reported that the macro average accuracy (MAA) on UNIMIB2016 is 56% which spans across 65 classes28. Experiments performed on the BTBUFood-60 dataset containing 60 food categories using the VGG16 model resulted in an mAP of 67.7%29. To compare, the parametric experiments on our first CAFD classification dataset with 42 classes showed Top-1 and Top-5 accuracy metrics of 88.70 and 98.59, respectively, using the ResNet152 model30. Historically, the Central Asian diet is known for the high consumption of meat dishes and dairy products due to the nomadic lifestyle43. Figures 7 and 8 show the distribution of different dishes within these categories.

Fig. 7
figure 7

Distribution of instances across meat and processed meat products.

Full size image
Fig. 8
figure 8

Distribution of instances across dairy products.

Full size image

Figure 7 illustrates that the dataset’s meat-based classes with the highest number of instances include beef/lamb shashlik (11.9%), chicken shashlik (9.5%), sausages (8.8%), and fried beef/lamb (8.4%). Beef and lamb are grouped due to the difficulty in visually distinguishing these types of red meat. Notably, kazy-karta, a national dish based on horse meat, represents 6.9% of these instances. This distribution mirrors national statistics from the Bureau of National Statistics, indicating that beef is the most consumed meat in Kazakhstan, with an average of 24.68 kg per capita in 202344. The per capita consumption of horse meat is 6.74 kg. Lamb dishes and chicken are also popular, with per capita consumption at 5.29 kg and 5.04 kg, respectively. Additionally, sausages are consumed at 2.12 kg per capita, while minced meat products have a per capita consumption of 7.07 kg.

High meat consumption in Central Asia is influenced by cultural preferences and local availability. The average meat consumption in Central Asia ranges from 50 to 70 kg per capita annually, with an average daily intake of 124.76 g/day, among the highest globally. According to the World Population Review, Kazakhstan has the highest per capita lamb consumption worldwide, averaging 8.5 kg per person annually. Recent studies suggest that shifting from a diet high in animal-based proteins to one higher in plant-based proteins may help to reduce risk factors associated with cardiovascular diseases and overall mortality43.

As for the dairy products, Fig. 8 depicts the distribution of image instances in the dataset by dairy product type. The most frequently represented dairy products are smetana (18.4%), kurt (15.7%), and kymyz-kymyryan (15.6%), followed by cheese (14.4%), irimshik (9.2%), butter (8.1%), suzbe (6.7%), airan-katyk (5.6%), milk (4.9%), and condensed milk (1.5%). According to the Bureau of National Statistics, per capita dairy product consumption in Kazakhstan was 227.2 kg in 202344. This includes 14.889 liters of raw milk, 0.230 liters of concentrated milk without sugar, 13.166 liters of airan, 3.818 liters of smetana, 2.198 kg of irimshik, 0.581 kg of processed cheese, 4.224 kg of suzbe, and 1.084 kg of butter etc44.

Kazakhstan and Central Asia have a rich tradition of dairy consumption, with beverages like kumys and airan being particularly popular, especially in rural areas.Kumys, made from fermented mare’s milk, is known for its distinct taste and slight alcohol content, which results from lactic acid and alcoholic fermentation. It may also offer probiotic benefits. It is a nutritious, protein- and vitamin-rich drink with probiotic benefits45,46. Similarly, airan, a fermented milk drink made from mixed goat, cow, and sheep milk, is a daily staple known for its probiotics and easily digestible fatty acids and amino acids, which promote digestive health45,46.

Dairy products such as kurt, smetana, irimshik, and suzbe are also common in local cuisine. Kurt, a hard cheese made from dried sour milk, is a concentrated source of protein and calcium, is often consumed as a snack or with tea. Smetana, like sour cream, is widely used in various dishes, contributing fats and vitamins A and D. Irimshik, or dried curd, has a sweet, baked milk flavor and extended storage capacity, providing protein and minerals. Suzbe, a type of curd made by straining sour milk, is seasonally prepared and used in soups or consumed with milk or water, offering fats and proteins46.

These traditional dairy products are integral to the local diets, providing cultural significance and essential nutrients that support health and well-being. Their consumption reflects their importance in Central Asian culinary practices and their nutritional benefits45,46.

To conclude, CAFSD makes a valuable contribution to computer vision-assisted food and dietary tracking applications, which can also be used in various settings like smart restaurants and supermarkets. It could play a significant role in advancing local CV food-tracking applications, with the potential to enhance nutrition literacy, increase dietary awareness, and promote healthier food choices. These advancements have the potential to impact overall agriculture, the environment, and the food system in the region.

The performance outcomes of the trained food object detection models on our dataset demonstrate the effectiveness of the YOLOv8xl model as compared to smaller size models of the YOLOv8 family in accurately retrieving food-related information. As the next steps, we plan to integrate the model into a smartphone application and develop a dataset with macro- nutritional values and corresponding prediction models for more detailed guidance. A follow-up project will extend the dataset composition and the number of classes integrating regional food composition databases that encompass local foods, dishes, and their nutritional profiles. This step ensures our research is aligned with precise, region-specific data to support evidence-based decision-making for public health policies tailored to the target population. Furthermore, a comprehensive codebook to systematically link food images to their nutritional labels will be developed to visual data based on food type, portion size, and preparation method, with each image cross-referenced to nutritional data, including macronutrient and micronutrient profiles. This integration bridges the gap between visual representations of food and their corresponding nutritional values, making the dataset a valuable resource for dietary analysis and research, particularly in nutrition-related interventions.