Deep Learning for Image Classification: A Review

  • Conference paper
  • First Online: 06 March 2024
  • Cite this conference paper

research paper on image classification

  • Meng Wu 38 ,
  • Jin Zhou 38 ,
  • Yibin Peng 38 ,
  • Shuihua Wang 39 &
  • Yudong Zhang 40  

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1166))

Included in the following conference series:

  • International Conference on Medical Imaging and Computer-Aided Diagnosis

459 Accesses

Image classification is a cornerstone of computer vision and plays a crucial role in various fields. This paper pays close attention to some traditional deep-learning approaches to image classification. Although traditional approaches, including traditional machine learning approaches, are initially practical for image classification for handcrafted feature extraction methods, they still have many limitations, such as poor scalability. These limitations limit their development. Thus, deep learning approaches have been explored, symbolizing a significant step forward in the quest for automated visual understanding. Deep learning approaches, particularly CNNs, can automatically learn and present features from raw data. They are suitable for a wide range of image classification tasks. Like any other approach, deep learning approaches have flaws, too. In addition, datasets have been instrumental in benchmarking the capabilities of algorithms, and the transfer learning approaches have positively impacted image classification models. In short, challenges have always existed, and innovation needs persistence to create a better future.

M. Wu and J. Zhou—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

research paper on image classification

Deep Learning in Image Classification: Its Evolution, Methods, Challenges and Architectures

research paper on image classification

Major Convolutional Neural Networks in Image Classification: A Survey

research paper on image classification

Impact of Deep Learning in Image Processing and Computer Vision

Lee, H., Chatterjee, I., Cho, G.: AI-powered intelligent seaport mobility: enhancing container drayage efficiency through computer vision and deep learning. Appl. Sci. 13 (22), 12214 (2023). https://doi.org/10.3390/APP132212214

Article   Google Scholar  

Ul, H.A., et al.: MCNN: a multi-level CNN model for the classification of brain tumors in IoT-healthcare system. J. Ambient. Intell. Humaniz. Comput. 14 (5), 11–12 (2022)

Google Scholar  

Won, K.T., Son-Cheol, Y., Junku, Y.: Neural network-based underwater image classification for autonomous underwater vehicles. IFAC Proc. Vol. 41 (2), 15991–15995 (2008)

Jang, S., Li, S., Sung, Y.: FastText-based local feature visualization algorithm for merged image-based malware classification framework for cyber security and cyber defense. Mathematics 8 (3), 460 (2020)

Pallavi, R., Ashish, G.: A comprehensive systematic review of deep learning methods for hyperspectral images classification. Int. J. Remote Sens. 43 (17), 6221–6306 (2022)

Aishwarya, M.P., Padmanabha, R.: Ensemble of CNN models for classification of groundnut plant leaf disease detection. Smart Agricult. Technol. 6 (2023). https://doi.org/10.1016/J.ATECH.2023.100362

Yuan, S., Dezhong, P., Zhenwen, R.: Discrete aggregation hashing for image set classification. Exp. Syst. Appl. 237 , 121615 (2024). https://doi.org/10.1016/J.ESWA.2023.121615

Rajdeep, K., Rakesh, K., Meenu, G.: Deep neural network for food image classification and nutrient identification: a systematic review. Rev. Endocr. Metab. Disord. 24 (4), 633–653 (2023)

Biswajit, J., Sanjay, S., Gopal, K.N., Luca, S., Neeraj, S., Jasjit, S.S.: Artificial intelligence-based hybrid deep learning models for image classification: the first narrative review. Comput. Biol. Med. 137 , 104803 (2021)

Yanbin, L., Linchao, Z., Xiaohan, W., Makoto, Y., Yi, Y.: Bilaterally normalized scale-consistent Sinkhorn distance for few-shot image classification. IEEE Trans. Neural Netw. Learn. Syst. (2023). https://doi.org/10.1109/TNNLS.2023.3262351

Quinlan, D.B., Nazanin, E., Jean-Christophe, L., Christine, B., Farrokh, F., Massimo, P.: Machine learning applications to neuroimaging for glioma detection and classification: an artificial intelligence augmented systematic review. J. Clin. Neurosci. 89 , 177–198 (2021)

Hongkai, L., et al.: Focus on hierarchical features: soft-weighted hierarchical features network. Neurocomputing 516 , 182–193 (2023)

Almasoud, A.S., et al.: Deep learning with image classification based secure CPS for healthcare sector. Comput. Mater. Continua 72 (2), 2633–2648 (2022)

Jiashi, Z., Mengmeng, L., Weili, S., Yu, M., Zhengang, J., Bai, J.: A deep learning method for classification of chest X-ray images. J. Phys.: Conf. Ser. 1848 (1), 012030 (2021). https://doi.org/10.1088/1742-6596/1848/1/012030

Srigiri, K., Yepuganti, K.: Pre-trained deep learning models for brain MRI image classification. Front. Hum. Neurosci. 17 , 1150120 (2023)

Santosh, K., et al.: A novel multimodal framework for early diagnosis and classification of COPD based on CT scan images and multivariate pulmonary respiratory diseases. Comput. Methods Prog. Biomed. 243 , 107911 (2023)

Ahmed, S.A., Ammar, S., Muhammad, K., Khalid, T.M., Sulaiman, A.W.: Vehicle classification using deep feature fusion and genetic algorithms. Electronics 12 (2), 280 (2023)

Yanli, S., Yang, L., Dan, W., Jinglong, F., Feiwei, Q., Bin, C.: Malicious code classification method based on deep residual network and hybrid attention mechanism for edge security. Wirel. Commun. Mob. Comput. (2022). https://doi.org/10.1155/2022/3301718

Benmalek, M., Attia, A., Bouziane, A., Hassaballah, M.: A semi-supervised deep rule-based classifier for robust finger knuckle-print verification. Evol. Syst. 13 (6), 1–12 (2022)

Juanjuan, L., Defa, H.: An image classification method based on adaptive attention mechanism and feature extraction network. Comput. Intell. Neurosci. 2023 , 4305594 (2023)

Shallu, K., Priya, R., Tasleem, A., Jatinder, M.: Machine learning and deep learning based hybrid feature extraction and classification model using digital microscopic bacterial images. SN Comput. Sci. 4 (5) (2023). https://doi.org/10.1007/S42979-023-02138-9

Cormack, G.V., Grossman, M.R.: Scalability of continuous active learning for reliable high-recall text classification (2016). https://doi.org/10.1145/2983323.2983776

Lihua, L.: Research on image classification algorithm based on convolutional neural network. J. Phys.: Conf. Ser. 2083 (3), 032054 (2021). https://doi.org/10.1088/1742-6596/2083/3/032054

Druzhkov, P.N., Kustikova, V.D.: A survey of deep learning methods and software tools for image classification and object detection. Pattern Recognit Image Anal. 26 (1), 9–15 (2016)

Castelão, T.E., et al.: Detection and classification of soybean pests using deep learning with UAV images. Comput. Electron. Agricult. 179 , 105836 (2020)

Ashkan, G., Mohsen, E., Mahdi, D., Hamid, B.: LR-net: a block-based convolutional neural network for low-resolution image classification. Iran. J. Sci. Technol. Trans. Electric. Eng. 47 (4), 1561–1568 (2023)

Gao, L., Xiao, S., Hu, C., Yan, Y.: Hyperspectral image classification based on fusion of convolutional neural network and graph network. Appl. Sci. 13 (12), 7143 (2023). https://doi.org/10.3390/APP13127143

Elhani, D., Megherbi, A.C., Zitouni, A., Dornaika, F., Sbaa, S., Taleb-Ahmed, A.: Optimizing convolutional neural networks architecture using a modified particle swarm optimization for image classification. Exp. Syst. Appl. 229 , 120411 (2023). https://doi.org/10.1016/J.ESWA.2023.120411

Naisen, Y., Hong, T., Jianwei, Y., Xin, Y., Zhihua, X.: Accelerating the training process of convolutional neural networks for image classification by dropping training samples out. IEEE ACCESS 8 , 142393–142403 (2020)

Guofa, L., Zefeng, J., Yunlong, C., Shen, L., Xingda, Q., Dongpu, C.: ML-ANet: a transfer learning approach using adaptation network for multi-label image classification in autonomous driving. Chin. J. Mech. Eng. 34 (1) (2021). https://doi.org/10.1186/S10033-021-00598-9

Li, J., et al.: Autonomous Martian rock image classification based on transfer deep learning methods. Earth Sci. Inf. 13 (3), 1–13 (2020)

Tariku, G., Ghiglieno, I., Gilioli, G., Gentilin, F., Armiraglio, S., Serina, I.: Automated identification and classification of plant species in heterogeneous plant areas using unmanned aerial vehicle-collected RGB images and transfer learning. Drones 7 (10), 599 (2023). https://doi.org/10.3390/DRONES7100599

Liu, J., Chui, K.T., Lee, L.K.: Enhancing the accuracy of an image classification model using cross-modality transfer learning. Electronics 12 (15), 3316 (2023). https://doi.org/10.3390/ELECTRONICS12153316

Laith, A., et al.: Novel transfer learning approach for medical imaging with limited labeled data. Cancers 13 (7), 1590 (2021)

Liu, J., Wang, T., Skidmore, A., Sun, Y., Jia, P., Zhang, K.: Integrated 1D, 2D, and 3D CNNs enable robust and efficient land cover classification from hyperspectral imagery. Remote Sensing 15 (19), 4797 (2023). https://doi.org/10.3390/RS15194797

Liwei, S., Junjie, Z., Jia, L., Yueming, W., Dan, Z.: SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification. Optic. Quant. Electron. 55 (2) (2023). https://doi.org/10.1007/S11082-022-04399-9

Emily, D., Alex, H., Razvan, A., Andrew, S., Hilary, N.: On the genealogy of machine learning datasets: a critical history of ImageNet. Big Data Soc. 8 (2), 205395172110359 (2021). https://doi.org/10.1177/20539517211035955

Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115 (3), 211–252 (2015)

Article   MathSciNet   Google Scholar  

Lv, X.: CIFAR-10 image classification based on convolutional neural network. Front. Signal Process. 4 (4) (2020). https://doi.org/10.22606/FSP.2020.44004

Barz, B., Denzler, J.: Do we train on test data? Purging CIFAR of near-duplicates. J. Imaging 6 (6), 41 (2020)

Doan, T.N.: Large-scale insect pest image classification. J. Adv. Inf. Technol. 14 (2), 328–341 (2023). https://doi.org/10.12720/JAIT.14.2.328-341

Kadam, S.S., Adamuthe, A.C., Patil, A.B.: CNN model for image classification on MNIST and fashion-MNIST dataset. J. Sci. Res. 64 (02), 374–384 (2020)

Alvear-Sandoval, R.F., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: On improving CNNs performance: the case of MNIST. Information Fusion 52 , 106–109 (2018)

Zhao, Q., Wang, J.-L., Pao, T.-L., Wang, L.-Y.: Modified fuzzy rule-based classification system for early warning of student learning. J. Educ. Technol. Syst. 48 (3), 385–406 (2020)

Download references

The research work was supported by the open project of State Key Laboratory of Millimeter Waves (Grant No. K202218).

Author information

Authors and affiliations.

School of Physics and Information Engineering, Jiangsu Second Normal University, Nanjing, Jiangsu, 210016, People’s Republic of China

Meng Wu, Jin Zhou & Yibin Peng

Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, 215123, Jiangsu, China

Shuihua Wang

Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, 21589, Saudi Arabia

Yudong Zhang

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Jin Zhou or Yudong Zhang .

Editor information

Editors and affiliations.

Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China

Department of Informatics, University of Leicester, Leicester, UK

Yu-Dong Zhang

University of Manchester, Manchester, UK

Alejandro F. Frangi

Ethics declarations

Conflict of interest.

The author declares there is no conflict of interest regarding this paper.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Wu, M., Zhou, J., Peng, Y., Wang, S., Zhang, Y. (2024). Deep Learning for Image Classification: A Review. In: Su, R., Zhang, YD., Frangi, A.F. (eds) Proceedings of 2023 International Conference on Medical Imaging and Computer-Aided Diagnosis (MICAD 2023). MICAD 2023. Lecture Notes in Electrical Engineering, vol 1166. Springer, Singapore. https://doi.org/10.1007/978-981-97-1335-6_31

Download citation

DOI : https://doi.org/10.1007/978-981-97-1335-6_31

Published : 06 March 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-1334-9

Online ISBN : 978-981-97-1335-6

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Deep Learning Approaches for Image Classification

DOI: https://doi.org/10.1145/3573428.3573691 EITCE 2022: 2022 6th International Conference on Electronic Information Technology and Computer Engineering , Xiamen, China, October 2022

Deep learning models can achieve a higher accuracy result compared with traditional machine learning algorithm. It is widely useful in different areas, especially in images classification area. In recent years, because of the improvement of hardware and the discovery of new deep learning network structures, the accuracy and reliability of deep learning model used in image classification have been greatly improved. However, in the field of images classification with deep learning technology, the reviews of the recent researches are lack. This paper will make a review about the recent researches of images classification based on deep learning. It includes the latest studies to improve the performance about deep learning. Additionally, the potential problems and challenges on deep learning technology and the possible future improvement and research direction are analyzed and discussed in the review.

ACM Reference Format: Yanzheng Yu. 2022. Deep Learning Approaches for Image Classification. In 2022 6th International Conference on Electronic Information Technology and Computer Engineering (EITCE 2022), October 21-23, 2022, Xiamen, China . ACM, New York, NY, USA, 8 Pages. https://doi.org/10.1145/3573428.3573691

1 INTRODUCTION

In recent years, deep learning technology using in images classification area has a significant improvement. The deep learning model, AlexNet, introduced by the research [ 1 ], described that the deep learning method can have a very good classification performance. In automatic driving, monitoring system, attitude detection and other fields, image classification based on deep learning has also achieved a high performance.

In the early researches, machine learning methods are used for image classification tasks. According to the study [ 2 ], SVM method can be used in multilabel image classification. Yang et al. [ 3 ] used decision tree for image classification using hyperspectral images of plots of different tillage and received a 0.89 accuracy on classify tillage. In these traditional machine learning methods, it will firstly select and construct the corresponding features from samples by manual. Then, the machine learning process will work to fit these features to solve the classification tasks. However, these methods have two main problems. One is that all the feature engineering is done by manually. This requires operator to have a corresponding and enough background knowledge. Besides this, when facing a big data sample, the cost of selecting features by manually will increase. Secondly, the features selected manually are always the shallow features based on statistical methods. Some deeply and contained more valid information features cannot be identified by manually selecting. This reduces the accuracy of machine learning methods. Deep learning methods solve above problems.

Deep learning models are composed of multiple hidden layers. Each layer in model has activation function. It is possible for deep learning model to learn a high-level abstract information from the samples automatically according to the hidden layers. In this way, deep learning methods have a significant advantage on feature processing rather than manual processing. Therefore, in recent years, there have been many papers topic about the exploration and improvement of deep learning. In image classification task, there are also many applications of deep learning.

However, comparing with the recently research number and progress, the reviews about the deep learning methods used image classification tasks are lack. As a result, it is difficult for the beginners in deep learning area to obtain the latest progress. In addition, review is a phased summary of existing research and results. It is useful to avoid repeated researches and to point out the direction of next research and development. Therefore, this review summarizes the recently development of deep learning methods in image classification tasks. Firstly, some commonly used data sets for image classification are introduced. It also points out how to judge whether a dataset is suitable for deep learning process. Secondly, the review will give a detailed overview of some recent advances in the image classification field base on deep learning model. Finally, some remaining challenges are listed, and some directions about possible future research and solutions in deep learning image classification area are shown.

2 BACKGROUND

This review focuses on the recent research about doing image classification with deep learning models. Image classification is one of the applications using machine learning technology. The image classification task is that giving an input image and the trained machine learning model can predict it which class it belongs to. A good machine learning model have high accuracy, and some models' predictions are better than human predictions. There are many image classification applications including medical image recognition, monitoring system, automatic driving and other areas involving computer vision and classification. A large number of researches have shown that deep learning algorithms can be in a high performance on image classification task.

2.1 Dataset

Deep learning algorithms cannot be trained without data sets. By feeding a large number of samples from dataset into the training model, the model can learn and fit the dataset. Finally, the trained model can predict new and unknown samples. It is a high cost for researchers to collect and label data from scratch. Therefore, it is necessary to be aware of the recent publicly datasets.

2.1.1 Patch_camelyon. Patch_camelyon dataset [ 4 ] is an open source dataset of medical images. It contains more than three hundred thousand numbers of color images with 96x96 pixel about the histopathologic scans of lymph node sections. This dataset can be used as the input to train a model to predict whether given Iymph node sections image contains metastatic tissue.

2.1.2 SI-SCORE. The SI-Score dataset [ 5 ] is a multi-classification image dataset. In addition to different objects in different categories, the dataset also contains different states of the same object or different backgrounds with same object. For example, some samples are generated by fixing the background and then scaling, rotating and flipping one object. Additionally, some objects are placed in different backgrounds. Based on these properties, the SI-Score can be worked not only to train deep learning models, but also used to test the robustness about the trained models.

2.1.3 Quick Draw Dataset. Quick Draw Dataset [ 6 ] contains 50 million bitmaps drew by players. There are 345 categories. The image is in 28x28 pixels. This dataset can be used to train the model to identify painting.

2.2 Evaluation index

2.2.1 Accuracy, Precision and Recall. Confusion matrix is a matrix used to display the relationship between samples truth value and predicted value, see Table 1 .

Positive Prediction value Negative Prediction value
Positive Truth value True positive - TP False negative - FN
Negative Truth value False positive - FP True negative - TN

According to the table 1 , different model evaluation indexes can be defined as follows:

1. Accuracy:

By observing the formula of accuracy, it can be concluded that the accuracy can make an evaluation of the global accuracy about the model. However, this evaluation is limited and it is not possible to obtain the specific evaluation when the model tries to only make predictions on positive or negative samples.

2. Precision:

The evaluation about precision can know the rate about positive samples in the positive prediction given by model.

Recall evaluation can give a proportion which describes the sample correct predicted to be positive within all positive samples.

2.2.2 ROC Curve and AUC. In dichotomous classification problems, the model will have a threshold value. The predicted results greater than this threshold value will be classify to true, less than the threshold value will be predicted to be false. According to the confusion matrix, TPR, which represents True Positive Rate, can be defined by the formula:

By defining FPR value as the X-axis and TPR as the Y-axis, one ROC space can be obtained. The ROC curve of the model can be draw by plotting the (FPR, TPR) coordinates under each threshold values. Figure 1 displays a ROC curve.

Figure 1

According to ROC curve, the relationship between TPR and FPR of the model with different thresholds can be obtained intuitively.

A high-performance model is required to keep a low FPR and a high TPR. The AUC value can be collected by computing the area under the ROC curve. If model's AUC value is high, it indicates that a good classification performance can be obtained by setting an appropriate threshold value on this model. When AUC is equal to 1, that means this model is a perfect classifier.

3 RECENT RESEARCH

3.1 image classification with on self-supervised field.

According to the research [ 7 ], they studied the effectiveness of the pre-training process base on the self-learning method in medical image classification area. In the paper, two experiments had been designed. One is do classification on dermatology with digital image. The second is to do classification on a multi-label chest X-ray image.

The research introduces a novel method, called MICLe, as one of the self-supervise components. In this method, the input can be the images about one patient's pathology with different angle of view. The whole self-supervised process in this research is that do self-supervised learning with the input of unlabeled images first. Secondly, unlabeled medical images as the input, do an additional self-supervised learning. In this step, if one medical condition has multiple images, MICLe method will be used. Finally, use labeled medical images to make fine-tuning.

The specific process about MICLe method is shown as figure 2 .In this algorithm, if only one image for the medical condition, it will use data augmentation in standard way to obtain two views of this image. If it exists multiple images about the medical condition, an image pair can be created directly.

Figure 2

The conclusions of the research are detailed as bellow:

1. The performance is better if doing a pre-training with a self-supervise model on unlabeled medical images rather than only doing pre-training with ImageNet.

2. The novel method MICLe improve the performance of the self-supervise model.

3. A more detail information according to the experiments: the top-1 accuracy on dermatology classification has 6.7% improvement and the mean AUC on chest X-ray classification has 1% improvement, compared with only do pre-training with ImageNet dataset.

4. Big self-supervised models have a higher robustness and generalization

3.2 Disease diagnosis with Deep CNN of image classification

This paper [ 8 ] is focus on how to use CNN neural network model on chest X-ray dataset to make a classification on pneumonia. The paper list some methods used previous in medical image classification area and find that DNN, especially CNN, can make decision with a high accuracy like humans.

Authors pointed that a transfer learning model, InceptionV3 model, can hold a well performance on a small X-rays dataset about pneumonia image. Also, another research's experiments prove a new model, structure-capsule network, works well on small image dataset. By the limitation of the research on InceptionV3 model and structure-capsule network, this paper designed some new experiments, which are: Compare performance with three different ML model: SVM classifier with ORB, VGG16 model used on transfer learning and the model of InceptionV3, and structure-capsule network. How the influence on small X-rays dataset learning process with the avoid over-fitting methods: make augmentation on dataset, adjust the complexity of network, and make fine-tuned on convolutional layer.

VGG16 model is a deep layer with 16 weight layers [ 9 ]. Structure-capsule network is a model which contains multiple special capsules made by a group of neurons [ 10 ]. The model of InceptionV3 is shown as Table 2 [ 11 ].

type filter size/stride or remarks
conv layer $3 \times 3/2$
$3 \times 3/1$
conv padded layer $3 \times 3/1$
pool layer $3 \times 3/2$
conv layer $3 \times 3/1$
$3 \times 3/2$
$3 \times 3/1$
3 x Inception modules see figure , a
5 x Inception modules see figure , b
2 x Inception modules see figure , c
pool layer $8 \times 8$
linear logits
softmax classifier

3.3 Multi-scale relational network to improve image classification

This paper [ 12 ] is focus on meta-learning area. Meta-learning is the tech to point machine learn to learn. And the authors done a research, to find how to learn from a small image dataset quickly by using meta-learning. This research provides a new algorithm, which is meta-learning with model-independent property. And base on this algorithm, a multi-scale relational network was come up. In this network, according to ideal about meta-SGD, the learning process will combine learning rate and model parameters together. In addition, MAML algorithm will be used to optimize the models’ parameters. During meta-validation and meta-testing, inner gradient iteration will not be used.

The paper ued MAML and MetaSGD experiments to make a comparation. In the experiments, the model structure is 4 CNN layer modules and followed by a full connect layer. Each CNN layer module contains 64 numbers of 1 size filters, one batch normalize layer, one layer of modified linear element and a 2-size max pool layer. The full connect layer has nodes number 64. Loss function is used cross-entropy.

According to the experiments, multi-scale relational network can give model a better performance on generalizability and improve the accuracy on benchmark set. Compared with MAML, this network avoids the works on fine-tuning.

The paper also points out some limitations of the model. One is that, the network performs well on a smaller dataset Omniglot dataset but not ideal on a bigger one, minilmageNet dataset. The second is that, in the network, partial gradient information is dropped. This may lose some information.

3.4 CNN skills on image classification

This paper [ 13 ] has done a research about a collection of changes about CNN, which improve models’ accuracy but no computing complex increasing, on image classification area. These changes are summarized from previous papers by other authors. The changes work places include: model structure, data pre-process, loss function and learning rate.

In this paper, authors used ablation studying method to find the influence about these changes and proved that all these changes can individually improve the learning performance.

Authors also proved that all these changes work together can significantly improve the performance about CNN models.

This paper also shows that these changes can be worked on other network and dataset. According to experiments, the changes give an improvement on transfer learning on object detection area and semantic segmentation area.

4 CHALLENGES

Although image classification task based on deep learning models has achieved good performance, there are still some challenges. These challenges may be some of the limitations waiting to be resolved, or some points waiting for further experimental to verify.

According to the research [ 7 ], they pointed that although self-supervised learning has been shown can perform a high accuracy result in the medical image classification field, the limitations of self-supervised learning in this area are still unknown. Therefore, it is required to study the limitation by running self-supervise learning with a huge unlabeled dataset. Additionally, another research direction is to study how to make a transfer from one image type and task to another.

In these researches, deep learning models are all trained based on given datasets. However, in real world, the input data distribution may be huge different from the training data distribution. It may even exist data which are not contained in the training dataset, and have completely different feature distributions. For example, in the research [ 7 ] of medical image classification, the models are trained with the images of medical conditions which are occurred and recorded. However, when a new medical condition is happened, the trained deep learning models may cannot recognize and give a wrong prediction. This may take a serious influence. How to improve the robustness and generalization of deep learning models for data with different feature distributions is an important challenge in deep learning researches.

5 CONCLUSION

Deep learning has a good performance in the image classification tasks. However, as the model structure of deep learning becomes more complex and deeper, the requirement for large amount of data for training process is significantly increasing. Some researches mentioned in this review can solve a part of requirement about big data in deep learning models. However, this approach still has some shortcomings. In the future, the main research direction could focus on the requirements of big data in deep learning. The challenges faced by the deep learning technology listed in this review can also be the future research directions.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
  • Li, X., Wang, L., & Sung, E. 2004, October. Multilabel SVM active learning for image classification. In 2004 International Conference on Image Processing, 2004. ICIP'04. (Vol. 4, pp. 2207-2210). IEEE.
  • Yang, C. C., Prasher, S. O., Enright, P., Madramootoo, C., Burgess, M., Goel, P. K., & Callum, I. 2003. Application of decision tree technology for image classification using remote sensing data. Agricultural Systems, 76(3), 1101-1117.
  • Veeling, B. S., Linmans, J., Winkens, J., Cohen, T., & Welling, M. 2018, September. Rotation equivariant CNNs for digital pathology. In International Conference on Medical image computing and computer-assisted intervention (pp. 210-218). Springer, Cham.
  • Djolonga, J., Yung, J., Tschannen, M., Romijnders, R., Beyer, L., Kolesnikov, A., ... & Lucic, M. 2021. On robustness and transferability of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16458-16468).
  • Ha, D., & Eck, D. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 .
  • Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., ... & Norouzi, M. 2021. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 3478-3488).
  • Yadav, S. S., & Jadhav, S. M. 2019. Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big Data, 6(1), 1-18.
  • Simonyan, K., & Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
  • Sabour, S., Frosst, N., & Hinton, G. E. 2017. Dynamic routing between capsules. Advances in neural information processing systems, 30.
  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
  • Zheng, W., Liu, X., & Yin, L. 2021. Research on image classification method based on improved multi-scale relational network. PeerJ Computer Science, 7, e613.
  • He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., & Li, M. 2019. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 558-567).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] .

EITCE 2022, October 21–23, 2022, Xiamen, China

© 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9714-8/22/10…$15.00. DOI: https://doi.org/10.1145/3573428.3573691

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, image classification.

4106 papers with code • 152 benchmarks • 251 datasets

Image Classification is a fundamental task in vision recognition that aims to understand and categorize an image as a whole under a specific label. Unlike object detection , which involves classification and location of multiple objects within an image, image classification typically pertains to single-object images. When the classification becomes highly detailed or reaches instance-level, it is often referred to as image retrieval , which also involves finding similar images in a large database.

Source: Metamorphic Testing for Object Detection Systems

research paper on image classification

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
OmniVec(ViT)
efficient adaptive ensembling
efficient adaptive ensembling
VIT-L/16 (Spinal FC)
CoCa
Branching/Merging CNN + Homogeneous Vector Capsules
OmniVec2
E2E-M3
Baseline (ViT-G/14)
CCT-14/7x2
LRA-diffusion (CC)
LRA-diffusion (CLIP ViT)
ALIGN (50 hypers/task)
Model soups (BASIC-L)
Fine-Tuning DARTS
VGG-5 (Spinal FC)
Hiera-H (448px)
NOAH-ViTB/16
efficient adaptive ensembling
Astroformer
ViT-Large/16 (384)
ViT-Large/16 (384)
MAM (ViT-B/16)
InternImage-H
IMP+MTP(IntenImage-XL)
Hiera-H (448px)
WaveMixLite-128/7
ResNet50
Linear FT(ViT-L/14)
VIT-L/16 (Spinal FC, Background)
WaveMixLite-112/16
CurriculumNet
CoAtNet-1
EGNN+Transduction
Heinsen Routing
WaveMixLite-112/16
Bamboo (ViTB/16)
OmniVec2
efficient adaptive ensembling
MLP-DecAug
TWIST (ResNet-50)
NCR (ResNet-18)
NCR (ResNet-18)
NCR (ResNet-18)
adaptive minimal ensembling
LRA-diffusion (CLIP ViT)
BiT-L (ResNet)
UPANets
EfficientNet-B3
V-MoE-H/14 (Every-2)
STS-ResNet
InstanceGM-SS
Entropy-based Logic Explained Network
PGDF (ResNet-18)
SWAG (ViT H/14)
AG-Net
HyT-NAS-BA
cFlow
DL+PCA+GWO
TC-VII (with outside data)
kEffNet-B0 V2 16ch
FG-MAE (ViT-S/16)
Claude 3 Opus
HiFuse_Base
HiFuse_Small
DenseNet121_256x256_Nutrispace
Multi-task
PCGAN-CHAR
PCGAN-CHAR
PCGAN-CHAR
LRA-diffusion
Our Ensemble Learning-2
Fuzzy Distance Ensemble
FaMUS
MentorMix
FaMUS
L3D_original_2level
SimCLR
Sparse-CBM
SEER (RegNet10B)
SEER (RegNet10B)
Diffusion Classifier (zero-shot)
µ2Net+ (ViT-L/16)
SparseSwin with L2
WaveMix
WaveMix
Inception-v3
ASF-former-S
mMND (STDP)
ResMLP-24
UnMixMatch
HiFuse_Base
CapsNet
VGG-5(Spinal FC)
E2E-3M
WRN-28-2 + UDA+AutoDropout
ResNet-50 + UDA+AutoDropout
CoNAL
E2E-3M
ResNet-50
NNCLR
PDO-eConv (ours)
PDO-eConv (ours)
Max Margin Contrastive
MentorMix
Fuzzy rank-based fusion of CNN models using Gompertz function
TransBoost-ResNet50
µ2Net+ (ViT-L/16)
HSANR
ResNet-152 2x (RS training)
ThanosNet
EnGraf-Net101 (G=4, H=1)
SEER (RegNet10B)
ResNet-18
WaveMixLite-128/16
WaveMix-256/16 (level 2)
AP-GeM (ResNet-101)
kMobileNet V3 Large 16ch
shreynet
BinaryViT
FedAvgM + ASAM + SWA
µ2Net (ViT-L/16)
µ2Net+ (ViT-L/16)
µ2Net+ (ViT-L/16)
ResNet50
TransBoost-ResNet50
pFedBreD_ns_mg
SqueezeNet + Simple Bypass
SqueezeNet + Simple Bypass
µ2Net+ (ViT-L/16)
RADAM (ConvNeXt-XL)
RADAM (ConvNeXt-L)
ResNet
WaveMixLite
ResNet
WRN (N=28, k=10)
WRN (N=36, k=5)
VOLO-D5
Model with negotiation paradigm
Model with negotiation paradigm
Model with negotiation paradigm
Model with negotiation paradigm
ArabSignNet
ArabSignNet
RedNet-152
CNN+ Wilson-Cowan model RNN
Ours
efficient adaptive ensembling
SAM
EfficientNet-L2-Ns

research paper on image classification

Latest papers

The ademamix optimizer: better, faster, older.

research paper on image classification

This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients.

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications.

Non-Uniform Illumination Attack for Fooling Convolutional Neural Networks

This strategy seeks to bolster CNN models' resilience against NUI attacks.

FC-KAN: Function Combinations in Kolmogorov-Arnold Networks

In this paper, we introduce FC-KAN, a Kolmogorov-Arnold Network (KAN) that leverages combinations of popular mathematical functions such as B-splines, wavelets, and radial basis functions on low-dimensional data through element-wise operations.

Enhancing Fine-Grained Visual Recognition in the Low-Data Regime Through Feature Magnitude Regularization

Moreover, in the context of building FGVR models with limited data, these irrelevant features can dominate the training process, overshadowing more useful, generalizable discriminative features.

Robust Clustering on High-Dimensional Data with Stochastic Quantization

kaydotdev/stochastic-quantization • 3 Sep 2024

In contrast, the Stochastic Quantization algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks.

Multi-Modal Adapter for Vision-Language Models

Multi-Modal Adapter demonstrates improved generalizability, based on its performance on unseen classes compared to existing adaptation methods.

Spatial-Aware Conformal Prediction for Trustworthy Hyperspectral Image Classification

Hyperspectral image (HSI) classification involves assigning specific labels to each pixel to identify various land cover categories.

SOOD-ImageNet: a Large-Scale Dataset for Semantic Out-Of-Distribution Image Classification and Semantic Segmentation

Out-of-Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real-world scenarios.

Real-Time Weather Image Classification with SVM

eitanspi/weather-image-classification • 1 Sep 2024

Accurate classification of weather conditions in images is essential for enhancing the performance of object detection and classification models under varying weather conditions.

Captcha Page

We apologize for the inconvenience...

To ensure we keep this website safe, please can you confirm you are a human by ticking the box below.

If you are unable to complete the above request please contact us using the below link, providing a screenshot of your experience.

https://ioppublishing.org/contacts/

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 03 September 2024

A multi-task deep learning approach for real-time view classification and quality assessment of echocardiographic images

  • Xinyu Li 1   na1 ,
  • Hongmei Zhang 2 , 3   na1 ,
  • Jing Yue 1 ,
  • Lixue Yin 2 , 3 ,
  • Wenhua Li 2 , 3 ,
  • Geqi Ding 2 , 3 ,
  • Bo Peng 1 &
  • Shenghua Xie 2 , 3  

Scientific Reports volume  14 , Article number:  20484 ( 2024 ) Cite this article

2 Altmetric

Metrics details

  • Biomedical engineering
  • Image processing
  • Medical imaging

High-quality standard views in two-dimensional echocardiography are essential for accurate cardiovascular disease diagnosis and treatment decisions. However, the quality of echocardiographic images is highly dependent on the practitioner’s experience. Ensuring timely quality control of echocardiographic images in the clinical setting remains a significant challenge. In this study, we aimed to propose new quality assessment criteria and develop a multi-task deep learning model for real-time multi-view classification and image quality assessment (six standard views and “others”). A total of 170,311 echocardiographic images collected between 2015 and 2022 were utilized to develop and evaluate the model. On the test set, the model achieved an overall classification accuracy of 97.8% (95%CI 97.7–98.0) and a mean absolute error of 6.54 (95%CI 6.43–6.66). A single-frame inference time of 2.8 ms was achieved, meeting real-time requirements. We also analyzed pre-stored images from three distinct groups of echocardiographers (junior, senior, and expert) to evaluate the clinical feasibility of the model. Our multi-task model can provide objective, reproducible, and clinically significant view quality assessment results for echocardiographic images, potentially optimizing the clinical image acquisition process and improving AI-assisted diagnosis accuracy.

Similar content being viewed by others

research paper on image classification

A deep multi-stream model for robust prediction of left ventricular ejection fraction in 2D echocardiography

research paper on image classification

Deep learning for transesophageal echocardiography view classification

research paper on image classification

Efficient deep learning-based automated diagnosis from echocardiography with contrastive self-supervised learning

Introduction.

Two-dimensional transthoracic echocardiography is widely used as a non-invasive, radiation-free, low-cost, and real-time imaging modality to assess cardiac function and cardiovascular disease diagnosis 1 . Cardiac images acquired based on specific probe positions and angles are referred to as standard views, which combined provide comprehensive information on cardiac structure and function 2 . High-quality standard views are the basis for reliable cardiac parameter measurements and accurate diagnosis 3 , 4 . However, compared with other imaging modalities, the acquisition process for clinical echocardiographic images is less automated, with echocardiographers manually adjusting probe positions and parameters, subjectively recognizing individual standard views, and selecting high-quality image frames 5 . The entire process is cumbersome, time-consuming, highly dependent on the echocardiographer's experience and maneuvers, and prone to inter- and intra-observer variability 6 , 7 , 8 . When the acquired views are of poor quality or key views are missing, the interpretation by human experts or artificial intelligence (AI) models is compromised. Current automated diagnostic frameworks for cardiac diseases do not yet incorporate image quality control (QC) into the analysis process, necessitating manual pre-filtering of low-quality images and thereby limiting clinical applicability 9 . Ideally, real-time QC should be implemented during patient image acquisition to ensure that the optimal images are captured within the same examination. Therefore, performing automatic view recognition and quality assessment during image acquisition or before downstream image analysis tasks is crucial. The former directly generates a high-quality image base, while the latter screens the optimal images for subsequent analysis.

Recently, deep learning has been widely used in echocardiographic image analysis, enabling automated clinical workflows for view classification, cardiac structure extraction, cardiac function quantification, and cardiac disease diagnosis 10 , 11 , 12 , 13 . View classification is the first step in echocardiographic analysis. Previous studies have proposed view classification models based on convolutional neural networks (CNNs), such as VGGNet and ResNet, achieving good recognition performance 14 , 15 , 16 , 17 . On-going research on image quality assessment mainly targets natural images and focuses on the various distortions of natural images during acquisition, compression, storage, and transmission 18 , 19 . Noise and artifacts are commonly found in ultrasound images due to the coherent interference of scattered waves 20 . However, unlike natural images, noisy ultrasound images are not always of low quality. The quality assessment of ultrasound images must also consider clinical practice requirements, emphasizing the visibility and integrity of specific anatomical structures 21 . Several studies have implemented quality assessment based on echocardiographic images, which are broadly categorized into three forms: categorical confidence, quality level classification, and quality score regression. Huang et al. 22 and Zhang et al. 23 used classification confidence for standard view recognition as the image quality score. The view classification confidence level represents the model's confidence in the predicted results but does not directly represent the image quality level from a clinical practice perspective. Zamzmi et al. 24 proposed a MobileNetV2-s-based encoder-decoder network to recognize four standard views and classify them into two quality levels (good or poor). However, providing a continuous numerical score feedback to the operator during the image-acquisition process is more helpful than providing discrete quality-level feedback. Abdi et al. 25 categorized the quality of end-systolic apical 4-chamber (A4C) view frames into six levels (0 to 5) based on the visibility and clarity of anatomical structures and proposed a nine-layer CNN for score regression prediction. Some studies were conducted based on echocardiography videos. Luong et al. 26 set four quality score levels (0.25, 0.5, 0.75, and 1.0) based on the visibility of anatomical structures, and combined the DenseNet and LSTM networks to simultaneously achieve view recognition and quality assessment for nine echocardiographic videos. Labs et al. 27 combined four convolutional layers and an LSTM network to assign scores ranging from zero to ten to four quality attributes in A4C and PLAX videos, respectively. These studies provide effective standard view quality assessment methods; however, several limitations remain. First, the current image quality assessment criteria are limited in scope and inadequate for clinically meaningful assessment of complex and diverse standard views. Second, most studies focus on a single task, with little research on a comprehensive multi-view classification and quality assessment pipeline. Third, traditional CNN architectures progressively abstract image information with deeper network layers, potentially losing spatial details at lower layers 28 .

In this study, we proposed four image quality attributes based on clinical practice needs and established evaluation criteria for six standard view categories accordingly. We developed a multi-task model that integrates view classification and image quality assessment into a unified framework. By sharing feature representations, multi-task learning enables the simultaneous learning of multiple related tasks in a single training step, effectively facilitating the exchange of information between tasks, thereby improving the overall performance and efficiency of the model 29 , 30 . Furthermore, we introduced the Feature Pyramid Network (FPN) 31 into echocardiographic image quality assessment for the first time to achieve the fusion and utilization of multi-scale features.

An overview of the study is provided in Fig.  1 . A multi-task deep learning model was trained on a dataset consisting of 170,311 echocardiographic images to automatically generate view categories and quality scores for clinical quality control workflows. This study conformed to the principles outlined in the Declaration of Helsinki and was approved by the Ethics Board of our institution (No. 2023–407).

figure 1

Overview of the study design. ( a ) Seven types of echocardiographic standard views were collected, including apical 4-chamber (A4C), parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), parasternal short axis at the apical level (PSAX-AP), and other views(Others). ( b ) Four quality attributes were summarized, including overall contour, key anatomical structural details, standard view display, and image display parameter adjustments. ( c ) Model development workflow for data collection, data labeling, data preprocessing, and model training. ( d ) Two clinical application workflows: the left side shows real-time quality control during image acquisition, and the right side shows pre-stored image screening prior to AI-assisted diagnosis. Artwork attribution in ( c ) and ( d ): www.flaticon.com .

The study is a retrospective study. A large number of echocardiographic studies were randomly extracted from the picture archiving and communication system (PACS) of the Sichuan Provincial People's Hospital between 2015 and 2022 to establish the experimental dataset, with all subjects aged 18 and above. Images showing severe cardiac malformations that prevented recognition of anatomical structures were excluded. The dataset consists of 107,311 echocardiographic images and includes six standard views commonly used in clinical practice: A4C view, parasternal view of the pulmonary artery (PSPA), parasternal long axis (PLAX), parasternal short axis at the mitral valve level (PSAX-MV), parasternal short axis at the papillary muscle level (PSAX-PM), and parasternal short axis at the apical level (PSAX-AP). Except for these six views, all other views are classified as “others”. For standard views with unevenly distributed quality levels, we performed undersampling to balance the data distribution. All images were acquired using ultrasound machines from different manufacturers such as Philips, GE, Siemens, and Mindray. The dataset was randomly divided into training (70%), validation (10%), and test (20%) sets through stratified sampling (Table 1 ). The distribution of quality scores for three subsets can be found as Supplementary Figure S1 online.

Quality scoring method

We established percentage scoring criteria for different standard views based on four attributes: overall contour, key anatomical structural details, standard view display (see Supplementary Fig. S2 online for an example), and image display parameter adjustments. Each attribute contributed to the score in a ratio of 3:4:2:1. Table 2 presents the scoring criteria for the PLAX view. Two accredited echocardiographers with at least five years of experience individually annotated all images in the dataset. The average of their annotations was used as the final expert score label. A third experienced cardiology expert, with over ten years of experience, conducted a review assessment of images with score differences of > 10. The “others” view was set to zero points for training purposes.

Model development

The model architecture is shown in Fig.  2 and mainly consists of a backbone network, neck network, and two branch modules for view classification and quality assessment. The backbone network is used to learn and extract the multi-scale image features. We choose the output feature maps \(\left\{{S}_{2},{S}_{3},{S}_{4},{S}_{5}\right\}\) (with output sizes of 1/4, 1/8, 1/16, and 1/32 of the original resolution, respectively) from the last four stages of the backbone network as the input for the neck network. To obtain the best backbone network, we compared six different deep CNN architectures, namely, MobileNetV3 32 , DenseNet121 33 , VGG16 34 , EfficientNet 35 , ResNet50 36 , and ConvNeXt 37 , and selected VGG16.

figure 2

Proposed multi-task model architecture. The model consists of a backbone network, a neck network, a view classification branch, and a quality assessment branch. A single-frame image is input into the backbone network to extract features. Subsequently, the neck network enhances and fuses these multi-scale features. The highest-level feature from the neck network is fed into the view classification branch to get a view class, while the fused multi-scale feature is input into the quality assessment branch to generate a quality score.

The neck network serves as an intermediate feature layer for further processing and fusing the features extracted from the backbone network for the two subsequent tasks. The highest-layer feature, \({S}_{5}\) , is a more discriminative high-level semantic feature that reflects the network's understanding of the overall context of the image and is suitable for classification tasks. Lv et al. 38 proposed that conducting the self-attention mechanism on high-level features with richer semantic concepts could capture the connections between conceptual entities in an image. Therefore, to further enhance the expressiveness of the features, we input \({S}_{5}\) into a Vision Transformer Block (VTB) 39 that unites a multi-head attention layer and a feedforward layer to facilitate intra-scale feature interaction. The feature map after this step is denoted as \({S}_{5}{\prime}\) , which is applied to view classification. Subsequently, FPN is applied to fuse the features of the four scales \(\left\{{S}_{5}{\prime},{S}_{4},{S}_{3},{S}_{2}\right\}\) in a layer-by-layer manner from the top down for cross-scale feature interactions. We denote the set of feature maps output from the FPN as \(\left\{{P}_{5},{P}_{4},{P}_{3},{P}_{2}\right\}\) , and each feature map has strong semantic information. Next, we fuse all scale feature maps using an Adaptive Feature Fusion Block (AFFB) to better model image quality perception. As shown in Fig.  3 , the AFFB module first upsamples the feature maps at different scales to the size of \({P}_{2}\) and then concatenates them. Subsequently, the channel attention is calculated using the Squeeze-and-Excitation Block 40 to adaptively adjust the importance of each channel feature. Finally, element-wise addition is performed on the features from each scale to generate the final fused feature map \(\text{F}\) , which is used to perform the quality assessment task.

figure 3

Adaptive Feature Fusion Block. The block integrates channel attention mechanisms to adaptively fuse feature outputs from the feature pyramid network at four scales, generating final quality-aware features for quality assessment.

For the view classification branch (VCB), a linear classifier is used to generate the view classification results. Simultaneously, a projection head is utilized to map the feature dimensions to a specified size to compute the Supervised Contrastive Loss 41 . The goal of supervised contrastive learning is to pull features of the same class closer together in the feature vector space while pushing the features of different classes apart. By applying supervised contrastive loss, we aimed to overcome the problem of small inter-class differences in echocardiographic images. For the quality assessment branch (QAB), a global average pooling is performed on feature map F to generate a K-dimensional feature vector, which is then fed to a multilayer perceptron (MLP) to fit and generate the final image quality score.

Model training

We jointly trained the model using the cross-entropy loss (for the view classification task), supervised contrastive loss, and mean squared error loss (for the quality assessment task). Additionally, to address the imbalance problem in multi-task training, an auto-tuning strategy 42 was applied to learn the relative loss weights for each task.

The model was implemented in Python v3.8.12 using PyTorch v1.12.0 and was iteratively trained on two NVIDIA GeForce RTX 3090 GPUs, each with 24 GB of RAM. During training, the initial learning rate was set to 1e-5, and the batch size was set to 128. The Adam optimizer with a weight decay of 1e-5 was used. The input images were resized to 224 × 224, and pixel values were normalized to the range 0 to 1. No data augmentation was performed to prevent changes in image quality. An early stop strategy was used to stop training and reduce overfitting. The best model from the validation set was applied to the test set to evaluate the model performance.

Evaluation metrics

Five performance evaluation metrics, accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), and F1 score (F1), were applied to validate the view classification performance. A confusion matrix was constructed to analyze the classification effect on different views. For quality assessment, Pearson’s linear correlation coefficient (PLCC), Spearman’s rank-order correlation coefficient (SROCC), mean absolute error (MAE), and root mean square error (RMSE) were used as evaluation indices. Indicators, such as the number of model parameters and inference time, were also considered to comprehensively evaluate the model performance. The Kruskal-Wallis test was employed to assess significant differences among the independent groups, with p < 0.05 considered statistically significant. For multiple comparisons, the Dunn-Bonferroni tests were applied. The Bootstrap analysis technique was utilized to calculate the 95% confidence intervals. Statistical analyses were conducted using SPSS v27.0 or Python v3.8.12.

Evaluation of view classification task

The overall accuracy of the view classification task on the test set was 97.8% (95%CI, 97.7–98.0), with macro-average PRE, SEN, SPE, and F1 exceeding 94.8% (Table 3 ). The confusion matrix is shown in Fig.  4 , which indicates that the model is prone to confusion when recognizing the three parasternal short axis views. The Grad-CAM maps in Fig.  5 reveal the image regions on which the model focuses when making classification decisions. It can be observed that the A4C view focuses on the mitral valve, tricuspid valve, ventricular septum, and atrial septum. The PLAX view focuses on the aortic and mitral valves, while the PSPA view focuses on the pulmonary artery wall and pulmonary valve. Additionally, the three similar parasternal short axis views effectively focus on key anatomical structural details, including the fish-mouth-like mitral valve orifice (mitral valve level), two sets of strong echo papillary muscles (papillary muscle level), and the annular left ventricular wall structure in three planes (apical level). To further show the robustness of the model, the Grad-CAM maps following data augmentation and under pathological conditions can be found in Supplementary Figures S3, S4 online, respectively.

figure 4

Confusion matrix between different views. The Y-axis of the confusion matrix shows the true labels and the X-axis shows the labels predicted by the model. Reading across true-label rows, the two numbers above and below each box indicate the sample size and its percentage.

figure 5

Grad-CAM maps for different views. Different colors indicate the activation level in different image areas. Red areas indicate high activation, while blue areas indicate low activation.

Evaluation of quality assessment task

The results of the quality assessment task on the test set are presented in Table 4 . The average PLCC and SROCC values were 0.898 (95%CI, 0.893–0.902) and 0.893 (95%CI, 0.888–0.897), respectively, indicating a strong correlation between the model-predicted and expert subjective scores. The average MAE and RMSE values were 6.54 (95%CI, 6.43–6.66) and 9.42 (95%CI, 9.24–9.60), respectively, which are within the acceptable range relative to the label range of 0–100. The scoring effect of the samples for each view is shown in Fig.  6 . It can be seen that there is a significant image quality improvement as the score increases.

figure 6

Examples of the test results for the six standard views. The orange value in each panel represents the expert scores, while the green values are the prediction scores for the proposed method.

Effect of different backbones and additional modules on the proposed method

The performance of the proposed method when implemented on different backbone networks is presented in Table 5 . Compared with other CNNs, VGG16 achieved the best trade-off between accuracy, number of parameters, and inference time. To analyze the effectiveness of each module in the proposed method, the ablation experiments were conducted using the VGG16-based quality assessment model (single task) as the baseline. As shown in Table 6 , with the sequential addition of the neck network, view classification task, and supervised contrastive loss modules, the performance of our model was significantly improved. Furthermore, we conducted a comparison to assess the impact of including or excluding the "others" view on model performance.

Application of the proposed method for echocardiographic image quality analysis

To verify the feasibility of our proposed method for standard view quality assessment, we compared the archived image quality among three groups of echocardiographers with different levels of experience (3 juniors, 3 seniors, and 3 experts). The junior group has 1–2 years of experience, the senior group has 4–5 years of experience, and the expert group has over 10 years of experience. The distribution of manufacturers among the three groups of echocardiographers was relatively balanced. We hypothesize that the image quality from the expert group is higher compared to the other groups. Images collected by nine echocardiographers between July and December 2023 were predicted using the proposed model. The subjects were males aged 18–40 years, without obvious cardiac structural or functional abnormalities. Based on the predictions, 6000 images were randomly selected from each echocardiographer, comprising 1000 images per view type. In total, 54,000 images from nine echocardiographers, covering six standard views, were used for statistical analysis.

The Kruskal-Wallis test indicated that there was a significant difference in quality scores across the three groups of echocardiographers on each view (p < 0.001). The box plot further illustrates the distribution of quality scores for the three groups across six standard views (Fig.  7 ). After adjusting for multiple comparisons, the median quality score of the expert group was higher than that of both the senior and junior groups for each view (p-adj < 0.001). Except for the PSPA view, the median quality score of the senior group was higher than that of the junior group (p-adj < 0.001).

figure 7

Distribution of quality scores by view and group. The box plot provides a visual representation of the distribution differences between different groups (junior, senior, and expert), detailing statistical measures such as the minimum, first quartile, median, third quartile, and maximum.

In this study, we developed and validated a multi-task model that simultaneously performs view recognition and percentile image quality assessment for seven types of views. The rationale for integrating these two tasks into a single model is that they are interrelated, as the view type determines the focus of the quality assessment. The results of the ablation experiments show that it is feasible to train a generic model to extract features from different echocardiographic views for quality assessment. Furthermore, introducing view classification as an auxiliary task can provide additional support for feature learning, which improves the quality assessment performance. The model performs well on both tasks. For the view classification task, misclassifications were mainly focused on three parasternal short axis views, which were mainly attributed to the high similarity between their anatomical structures. However, guided by the supervised contrastive learning loss to learn distinctive feature representations, relatively accurate recognition results can be obtained. For the quality assessment task, the results show that the proposed model incorporating quality-aware features at different scales effectively learned the judgment criteria used by human experts in echocardiographic image quality assessment. Even for the PSAX-AP view, our proposed model achieved acceptable results with a small sample size.

Compared with previous methods, our study has some strengths. For quality assessment, we summarized four clinically significant quality attributes that ensure image quality scores are closely aligned with diagnostic value. We applied the model to analyze archived images from echocardiographers of varying experience levels and confirmed that those with higher levels of experience produced higher-quality images. This demonstrates the model's ability to analyze image quality from a clinical diagnostic perspective effectively. Regarding model design, Prior methods focused solely on high-level single-scale features and overlooked low-level details. To address this issue, we added a hierarchical neck network to perform multi-scale perception modeling at a low computational cost, simulating the human visual system's hierarchical processing of visual stimuli at different scales. The results indicate that the quality assessment task significantly improved by adaptively integrating high-level semantic information with low-level detailed information through the neck network. From a clinical application perspective, our study utilizes echocardiographic images, rendering it more pertinent to real-world practice compared to video-based studies. The proposed model accurately captures static images at each moment, avoiding the complexity and computational costs incurred by using a 2D + t model to process dynamic video data. Additionally, our dataset encompasses six standard views and classifies all other views into an "others" category, enabling the model to directly differentiate between the six target views and other views. The results show that the introduction of the “others” view increases data diversity and improves the view categorization effect, with only a minor compromise on the quality assessment effect. In contrast, the model proposed by Luong et al. 26 does not include an "others" category and relies solely on a confidence threshold to classify images: images with confidence below the threshold belong to the "others" category, while those with confidence above the threshold belong to one of the target views. However, since lower-quality target views may exhibit lower confidence, setting an effective threshold to distinguish them from "others" is quite difficult. Particularly in AI-assisted diagnosis, misclassifying other views as the required standard views can significantly affect diagnostic accuracy.

The proposed multi-task model effectively reduces the deployment pressure by merging the two tasks, achieving a good trade-off between accuracy and inference time. After deployment on a 3090 GPU, the model required no more than 2.8 ms to process a frame of 224 × 224 pixels. The model can be developed as part of a QC system that is applicable in several clinical scenarios. In echocardiography training, immediate feedback regarding the types of views and quality scores can help novice operators master the technical essentials of echocardiography more quickly and alleviate the shortage of faculty in underdeveloped areas 43 . During echocardiographic examinations, the system can assist operators in standardizing imaging and monitoring the progress of view acquisition, reducing measurement variability, and improving diagnostic quality 44 , 45 . Furthermore, the system can perform post-analysis on large-scale stored images or serve as a preprocessing step for AI-assisted diagnostic systems, selecting high-quality, interpretable cardiac ultrasound images from pre-stored data.

Our study has several limitations. First, the standard views covered are limited, and commonly used apical series views such as the apical 2-chamber and apical 3-chamber need to be further incorporated. Since our method does not impose specific constraints on view selection, it can theoretically accommodate additional standard echocardiographic views. Second, our method generates an overall quality score for echocardiographic images, and the individual scoring of different quality attributes should be explored in future research. Third, although the model was developed with a diverse dataset, its robustness and reliability still necessitate further validation in real-world clinical settings.

Data availability

The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Code availability

The code for this work is available upon request.

Lang, R. M. et al. Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the American society of echocardiography and the european association of cardiovascular imaging. Eur. Heart J. Cardiovasc. Imaging 16 (3), 233–271 (2015).

Article   PubMed   Google Scholar  

Mitchell, C. et al. Guidelines for performing a comprehensive transthoracic echocardiographic examination in adults: recommendations from the American Society of Echocardiography. J. Am. Soc. Echocardiogr. 32 (1), 1–64 (2019).

Nagata, Y. et al. Impact of image quality on reliability of the measurements of left ventricular systolic function and global longitudinal strain in 2D echocardiography. Echo Res. Pract. 5 (1), 28–39 (2018).

Article   Google Scholar  

Foley, T. A. et al. Measuring left ventricular ejection fraction-techniques and potential pitfalls. Eur. Cardiol. 8 (2), 108–114 (2012).

Zhou, J., Du, M., Chang, S. & Chen, Z. Artificial intelligence in echocardiography: detection, functional evaluation, and disease diagnosis. Cardiovasc. Ultrasound 19 (1), 1–11 (2021).

Letnes, J. M. et al. Variability of echocardiographic measures of left ventricular diastolic function. The HUNT study. Echocardiography 38 (6), 901–908 (2021).

Liao, Z. et al. On modelling label uncertainty in deep neural networks: automatic estimation of intra-observer variability in 2d echocardiography quality assessment. IEEE Trans. Med. Imaging 39 (6), 1868–1883 (2019).

Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580 (7802), 252–256 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Liu, B. et al. A deep learning framework assisted echocardiography with diagnosis, lesion localization, phenogrouping heterogeneous disease, and anomaly detection. Sci. Rep. 13 (1), 3 (2023).

Barry, T. et al. The Role of Artificial Intelligence in Echocardiography. J. Imaging 9 (2), 50 (2023).

Article   PubMed   PubMed Central   Google Scholar  

Sehly, A. et al. Artificial Intelligence in Echocardiography: The Time is Now. Rev. Cardiovasc. Med. 23 (8), 256 (2022).

Kusunose, K. Steps to use artificial intelligence in echocardiography. J. Echocardiogr. 19 (1), 21–27 (2021).

Wang, W. et al. An Automated Heart Shunt Recognition Pipeline Using Deep Neural Networks. J. Imaging Informatics Med. 1–16 (2024).

Madani, A., Arnaout, R., Mofrad, M. & Arnaout, R. Fast and accurate view classification of echocardiograms using deep learning. NPJ Digit. Med. 1 (1), 6 (2018).

Santosh Kumar, B. P. et al. Fine-tuned convolutional neural network for different cardiac view classification. J. Supercomput. 78 (16), 18318–18335 (2022).

Belciug, S. Deep learning and Gaussian mixture modelling clustering mix a new approach for fetal morphology view plane differentiation. J. Biomed. Inform. 143 , 104402 (2023).

Wu, L. et al. Standard echocardiographic view recognition in diagnosis of congenital heart defects in children using deep learning based on knowledge distillation. Front. Pediatr. 9 , 770182 (2022).

Yang, S. et al. Maniqa: Multi-dimension attention network for no-reference image quality assessment. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 1191–1200 (2022).

Zhang, S. et al. CNN-based medical ultrasound image quality assessment. Complexity 2021 (1), 9938367 (2021).

Zhang, F., Yoo, Y. M., Koh, L. M. & Kim, Y. Nonlinear diffusion in Laplacian pyramid domain for ultrasonic speckle reduction. IEEE Trans. Med. Imaging 26 (2), 200–211 (2007).

Czajkowska, J., Juszczyk, J., Piejko, L. & Glenc-Ambroży, M. High-frequency ultrasound dataset for deep learning-based image quality assessment. Sensors 22 (4), 1478 (2022).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Huang, K. C. et al. Artificial intelligence aids cardiac image quality assessment for improving precision in strain measurements. Cardiovasc. Imaging 14 (2), 335–345 (2021).

Google Scholar  

Zhang, J. et al. Fully automated echocardiogram interpretation in clinical practice: feasibility and diagnostic accuracy. Circulation 138 (16), 1623–1635 (2018).

Zamzmi, G., Rajaraman, S., Hsu, L. Y., Sachdev, V. & Antani, S. Real-time echocardiography image analysis and quantification of cardiac indices. Med. Image. Anal. 80 , 102438 (2022).

Abdi, A. H. et al. Automatic quality assessment of echocardiograms using convolutional neural networks: feasibility on the apical four-chamber view. IEEE Trans. Med. Imaging 36 (6), 1221–1230 (2017).

Luong, C. et al. Automated estimation of echocardiogram image quality in hospitalized patients. Int. J. Cardiovasc. Imaging 37 , 229–239 (2021).

Labs, R. B., Vrettos, A., Loo, J. & Zolgharni, M. Automated assessment of transthoracic echocardiogram image quality using deep neural networks. Intell. Med. 3 (03), 191–199 (2023).

Ding, Y. et al. AP-CNN: Weakly supervised attention pyramid convolutional neural network for fine-grained visual classification. IEEE Trans. Image Process. 30 , 2826–2836 (2021).

Article   ADS   PubMed   Google Scholar  

Zhang, Y. & Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 5 (1), 30–43 (2018).

Article   ADS   Google Scholar  

Xu, Z., Zhang, Q., Li, W., Li, M. & Yip, P. S. F. Individualized prediction of depressive disorder in the elderly: a multitask deep learning approach. Int. J. Med. Inform. 132 , 103973 (2019).

Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B. & Belongie, S. Feature pyramid networks for object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2117–2125 (2017).

Howard, A. et al. Searching for mobilenetv3. Proc. IEEE/CVF Int. Conf. Comput. Vis. 1314–1324 (2019).

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 4700–4708 (2017).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Int. Conf. Learn. Representations 1–14 (2015).

Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. Int. Conf. Mach. Learn. 6105–6114 (2019).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 770–778 (2016).

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T. & Xie, S. A convnet for the 2020s. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 11976–11986 (2022).

Zhao, Y. et al. Detrs beat yolos on real-time object detection. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 16965–16974 (2024).

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2021).

Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 7132–7141 (2018).

Khosla, P. et al. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 33 , 18661–18673 (2020).

Liebel, L. & Körner, M. Auxiliary tasks in multi-task learning. Preprint at https://doi.org/10.48550/arXiv.1805.06334 (2018).

Narang, A. et al. Utility of a deep-learning algorithm to guide novices to acquire echocardiograms for limited diagnostic use. JAMA Cardiol. 6 (6), 624–632 (2021).

Ferraz, S., Coimbra, M. & Pedrosa, J. Assisted probe guidance in cardiac ultrasound: A review. Front. Cardiovasc. Med. 10 , 1056055 (2023).

Zhang, Z. et al. Artificial intelligence-enhanced echocardiography for systolic function assessment. J. Clin. Med. 11 (10), 2893 (2022).

Download references

This work was supported by the Sichuan Science and Technology Project (grant no. 2023YFQ0006) Project.

Author information

These authors contributed equally: Xinyu Li and Hongmei Zhang 

Authors and Affiliations

School of Computer Science and Software Engineering, Southwest Petroleum University, Chengdu, 610500, China

Xinyu Li, Jing Yue & Bo Peng

Ultrasound in Cardiac Electrophysiology and Biomechanics Key Laboratory of Sichuan Province, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, 32# W. Sec 2, 1st Ring Rd., Chengdu, 610072, China

Hongmei Zhang, Lixue Yin, Wenhua Li, Geqi Ding & Shenghua Xie

Department of Cardiovascular Ultrasound & Noninvasive Cardiology, Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, 32# W. Sec 2, 1st Ring Rd., Chengdu, 610072, China

You can also search for this author in PubMed   Google Scholar

Contributions

X.L: Software, Formal analysis, Methodology, Writing-Original Draft. H.Z: Conceptualization, Data curation, Writing-Review & Editing. J.Y: Methodology, Writing-Review & Editing. L.Y, W.L, G.D and B.P: Data curation, Writing—Review & Editing. S.X: Conceptualization, Methodology, Writing—Original Draft, Supervision, Funding acquisition. All authors approved the manuscript.

Corresponding author

Correspondence to Shenghua Xie .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

This study was approved by the Ethics Board of Sichuan Provincial People's Hospital (No. 2023–407). The Ethics Board of Sichuan Provincial People's Hospital also approved the waiver of informed consent for this study.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary figures., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Li, X., Zhang, H., Yue, J. et al. A multi-task deep learning approach for real-time view classification and quality assessment of echocardiographic images. Sci Rep 14 , 20484 (2024). https://doi.org/10.1038/s41598-024-71530-z

Download citation

Received : 12 July 2024

Accepted : 28 August 2024

Published : 03 September 2024

DOI : https://doi.org/10.1038/s41598-024-71530-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on image classification

  • Open access
  • Published: 11 February 2019

Research on image classification model based on deep convolution neural network

  • Mingyuan Xin 1 &
  • Yong Wang 2  

EURASIP Journal on Image and Video Processing volume  2019 , Article number:  40 ( 2019 ) Cite this article

49k Accesses

144 Citations

Metrics details

Based on the analysis of the error backpropagation algorithm, we propose an innovative training criterion of depth neural network for maximum interval minimum classification error. At the same time, the cross entropy and M 3 CE are analyzed and combined to obtain better results. Finally, we tested our proposed M3 CE-CEc on two deep learning standard databases, MNIST and CIFAR-10. The experimental results show that M 3  CE can enhance the cross-entropy, and it is an effective supplement to the cross-entropy criterion. M3 CE-CEc has obtained good results in both databases.

1 Introduction

Traditional machine learning methods (such as multilayer perception machines, support vector machines, etc.) mostly use shallow structures to deal with a limited number of samples and computing units. When the target objects have rich meanings, the performance and generalization ability of complex classification problems are obviously insufficient. The convolution neural network (CNN) developed in recent years has been widely used in the field of image processing because it is good at dealing with image classification and recognition problems and has brought great improvement in the accuracy of many machine learning tasks. It has become a powerful and universal deep learning model.

Convolutional neural network (CNN) is a multilayer neural network, and it is also the most classical and common deep learning framework. A new reconstruction algorithm based on convolutional neural networks is proposed by Newman et al. [ 1 ] and its advantages in speed and performance are demonstrated. Wang et al. [ 2 ] discussed three methods, that is, the CNN model with pretraining or fine-tuning and the hybrid method. The first two executive images are passed to the network one time, while the last category uses a patch-based feature extraction scheme. The survey provides a milestone in modern case retrieval, reviews a wide selection of different categories of previous work, and provides insights into the link between SIFT and the CNN based approach. After analyzing and comparing the retrieval performance of different categories on several data sets, we discuss a new direction of general and special case retrieval. Convolution neural network (CNN) is very interested in machine learning and has excellent performance in hyperspectral image classification. Al-Saffar et al. [ 3 ] proposed a classification framework called region-based pluralistic CNN, which can encode semantic context-aware representations to obtain promising features. By combining a set of different discriminant appearance factors, the representation based on CNN presents the spatial spectral contextual sensitivity that is essential for accurate pixel classification. The proposed method for learning contextual interaction features using various region-based inputs is expected to have more discriminant power. Then, the combined representation containing rich spectrum and spatial information is fed to the fully connected network and the label of each pixel vector is predicted by the Softmax layer. The experimental results of the widely used hyperspectral image datasets show that the proposed method can outperform any other traditional deep-learning-based classifiers and other advanced classifiers. Context-based convolution neural network (CNN) with deep structure and pixel-based multilayer perceptron (MLP) with shallow structure are recognized neural network algorithms which represent the most advanced depth learning methods and classical non-neural network algorithms. The two algorithms with very different behaviors are integrated in a concise and efficient manner, and a rule-based decision fusion method is used to classify very fine spatial resolution (VFSR) remote sensing images. The decision fusion rules, which are mainly based on the CNN classification confidence design, reflect the usual complementary patterns of each classifier. Therefore, the ensemble classifier MLP-CNN proposed by Said et al. [ 4 ] acquires supplementary results obtained from CNN based on deep spatial feature representation and MLP based on spectral discrimination. At the same time, the CNN constraints resulting from the use of convolution filters, such as the uncertainty of object boundary segmentation and the loss of useful fine spatial resolution details, are compensated. The validity of the ensemble MLP-CNN classifier was tested in urban and rural areas using aerial photography and additional satellite sensor data sets. MLP-CNN classifier achieves promising performance and is always superior to pixel based MLP, spectral and texture based MLP, and context-based CNN in classification accuracy. The research paves the way for solving the complex problem of VFSR image classification effectively. Periodic inspection of nuclear power plant components is important to ensure safe operation. However, current practice is time-consuming, tedious, and subjective, involving human technicians examining videos and identifying reactor cracks. Some vision-based crack detection methods have been developed for metal surfaces, and they generally perform poorly when used to analyze nuclear inspection videos. Detecting these cracks is a challenging task because of their small size and the presence of noise patterns on the surface of the components. Huang et al. [ 5 ] proposed a depth learning framework based on convolutional neural network (CNN) and Naive Bayes data fusion scheme (called NB-CNN), which can be used to analyze a single video frame for crack detection. At the same time, a new data fusion scheme is proposed to aggregate the information extracted from each video frame to enhance the overall performance and robustness of the system. In this paper, a CNN is proposed to detect the fissures in each video frame, the proposed data fusion scheme maintains the temporal and spatial coherence of the cracks in the video, and the Naive Bayes decision effectively discards the false positives. The proposed framework achieves a hit rate of 98.3% 0.1 false positives per frame which is significantly higher than the most advanced method proposed in this paper. The prediction of visual attention data from any type of media is valuable to content creators and is used to drive coding algorithms effectively. With the current trend in the field of virtual reality (VR), the adaptation of known technologies to this new media is beginning to gain momentum R. Gupta and Bhavsar [ 6 ] proposed an extension to the architecture of any convolutional neural network (CNN) to fine-tune traditional 2D significant prediction to omnidirectional image (ODI). In an end-to-end manner, it is shown that each step in the pipeline presented by them is aimed at making the generated salient map more accurate than the ground live data. Convolutional neural network (Ann) is a kind of depth machine learning method derived from artificial neural network (Ann), which has achieved great success in the field of image recognition in recent years. The training algorithm of neural network is based on the error backpropagation algorithm (BP), which is based on the decrease of precision. However, with the increase of the number of neural network layers, the number of weight parameters will increase sharply, which leads to the slow convergence speed of the BP algorithm. The training time is too long. However, CNN training algorithm is a variant of BP algorithm. By means of local connection and weight sharing, the network structure is more similar to the biological neural network, which not only keeps the deep structure of the network, but also greatly reduces the network parameters, so that the model has good generalization energy and is easier to train. This advantage is more obvious when the network input is a multi-dimensional image, so that the image can be directly used as the network input, avoiding the complex feature extraction and data reconstruction process in traditional recognition algorithm. Therefore, convolutional neural networks can also be interpreted as a multilayer perceptron designed to recognize two-dimensional shapes, which are highly invariant to translation, scaling, tilting, or other forms of deformation [ 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ].

With the rapid development of mobile Internet technology, more and more image information is stored on the Internet. Image has become another important network information carrier after text. Under this background, it is very important to make use of a computer to classify and recognize these images intelligently and make them serve human beings better. In the initial stage of image classification and recognition, people mainly use this technology to meet some auxiliary needs, such as Baidu’s star face function can help users find the most similar star. Using OCR technology to extract text and information from images, it is very important for graph-based semi-supervised learning method to construct good graphics that can capture intrinsic data structures. This method is widely used in hyperspectral image classification with a small number of labeled samples. Among the existing graphic construction methods, sparse representation (based on SR) shows an impressive performance in semi-supervised HSI classification tasks. However, most algorithms based on SR fail to consider the rich spatial information of HSI, which has been proved to be beneficial to classification tasks. Yan et al. [ 16 ] proposed a space and class structure regularized sparse representation (SCSSR) graph for semi-supervised HSI classification. Specifically, spatial information has been incorporated into the SR model through graph Laplace regularization, which assumes that spatial neighbors should have similar representation coefficients, so the obtained coefficient matrix can more accurately reflect the similarity between samples. In addition, they also combine the probabilistic class structure (which means the probabilistic relationship between each sample and each class) into the SR model to further improve the differentiability of graphs. Hyion and AVIRIS hyperspectral data show that our method is superior to the most advanced method. The invariance extracted by Zhang et al. [ 17 ], such as the specificity of uniform samples and the invariance of rotation invariance, is very important for object detection and classification applications. Current research focuses on the specific invariance of features, such as rotation invariance. In this paper, a new multichannel convolution neural network (mCNN) is proposed to extract the invariant features of object classification. Multi-channel convolution sharing the same weight is used to reduce the characteristic variance of sample pairs with different rotation in the same class. As a result, the invariance of the uniform object and the rotation invariance are encountered simultaneously to improve the invariance of the feature. More importantly, the proposed mCNN is particularly effective for small training samples. The experimental results of two datum datasets for handwriting recognition show that the proposed mCNN is very effective for extracting invariant features with a small number of training samples. With the development of big data era, convolutional neural network (CNN) with more hidden layers has more complex network structure and stronger feature learning and feature expression ability than traditional machine learning methods. Since the introduction of the convolutional neural network model trained by the deep learning algorithm, significant achievements have been made in many large-scale recognition tasks in the field of computer vision. Chaib et al. [ 18 ] first introduced the rise and development of deep learning and convolution neural network and summarized the basic model structure, convolution feature extraction, and pool operation of convolution neural network. Then, the research status and development trend of convolution neural network model based on deep learning in image classification are reviewed, and the typical network structure, training method, and performance are introduced. Finally, some problems in the current research are briefly summarized and discussed, and new directions of future development are predicted. Computer diagnostic technology has played an important role in medical diagnosis from the beginning to now. Especially, image classification technology, from the initial theoretical research to clinical diagnosis, has provided effective assistance for the diagnosis of various diseases. In addition, the image is the concrete image formed in the human brain by the objective things existing in the natural environment, and it is an important source of information for a human to obtain the knowledge of the external things. With the continuous development of computer technology, the general object image recognition technology in natural scene is applied more and more in daily life. From image processing technology in simple bar code recognition to text recognition (such as handwritten character recognition and optical character recognition OCR etc.) to biometric recognition (such as fingerprint, sound, iris, face, gestures, emotion recognition, etc.), there are many successful applications. Image recognition (Image Recognition), especially (Object Category Recognition) in natural scenes, is a unique skill of human beings. In a complex natural environment, people can identify concrete objects (such as teacups) at a glance (swallow, etc.) or a specific category of objects (household goods, birds, etc.). However, there are still many questions about how human beings do this and how to apply these related technologies to computers so that they have humanoid intelligence. Therefore, the research of image recognition algorithms is still in the fields of machine vision, machine learning, depth learning, and artificial intelligence [ 19 , 20 , 21 , 22 , 23 , 24 ].

Therefore, this paper applies the advantage of depth mining convolution neural network to image classification, tests the loss function constructed by M 3  CE on two depth learning standard databases MNIST and CIFAR-10, and pushes forward the new direction of image classification research.

2 Proposed method

Image classification is one of the hot research directions in computer vision field, and it is also the basic image classification system in other image application fields, which is usually divided into three important parts: image preprocessing, image feature extraction and classifier.

2.1 The ZCA process is shown as below

In this process, we first use PCA to zero the mean value. In this paper, we assume that X represents the image vector [ 25 ]: \( \mu =\frac{1}{m}\sum \limits_{j=1}^m{x}_j \)

Next, the covariance matrix for the entire data is calculated, with the following formulas:

where I represents the covariance matrix, I is decomposed by SVD [ 26 ], and its eigenvalues and corresponding eigenvectors are obtained.

Of which, U is the eigenvector matrix of ∑, and S is the eigenvalue matrix of ∑. Based on this, x can be whitened by PCA, and the formula is:

So X ZCAwhiten can be expressed as

For the data set in this paper, because the training sample and the test sample are not well distinguished [ 27 ], the random generation method is used to avoid the subjective color of the artificial classification.

2.2 Image feature extraction based on time-frequency composite weighting

Feature extraction is a concept in computer vision and image processing. It refers to the use of a computer to extract image information and determine whether the points of each image belong to an image feature extraction. The purpose of feature extraction is to divide the points on the image into different subsets, which are often isolated points, a continuous curve, or region. There are usually many kinds of features to describe the image. These features can be classified according to different criteria, such as point features, line features, and regional characteristics according to the representation of the features on the image data. According to the region size of feature extraction, it can be divided into two categories: global feature and local feature [ 24 ]. The image features used in some feature extraction methods in this paper include color feature and texture feature, analysis of the current situation of corner feature, and edge feature.

The time-frequency composite weighting algorithm for multi-frame blurred images is a frequency-domain and time-domain weighting simultaneous processing algorithm based on blurred image data. Based on the weighted characteristic of the algorithm and the feature extraction of target image in time domain and frequency domain, the depth extraction technique is based on the time-frequency composite weighting of night image to extract the target information from depth image. The main steps of the time-frequency composite weighted feature extraction method are as follows:

Step 1: Construct a time-frequency composite weighted signal model for multiple blurred images, as the following expression shows:

Of which, f ( t ) is original signal, S  = ( c  −  v )/( c  +  v ), called the image scale factor. Referred to as scale, it represents the signal scaling change of the original image time-frequency composite weighting algorithm. \( \sqrt{S} \) is the normalized factor of image time-frequency composite weighting algorithm.

Step 2: Map the one-dimensional function to the two-dimensional function y ( t ) of the time scale a and the time shift b , and perform a time-frequency composite weighted transform on the continuous nighttime image of the image time-frequency composite weighted 0 using the square integrable function as shown below:

Of which, divisor \( 1/\sqrt{\mid a\mid } \) . The energy normalization of the unitary transformation is ensured. ψ a , b is ψ ( t ) obtained by transforming U ( a ,  b ) through the affine group, as shown by the following expression:

Step 3: Substituting the variable of the original image f ( t )by a  = 1/ s and b  =  τ and rewriting the expression to obtain an expression:

Step 4: Build a multi-frame fuzzy image time-frequency composite weighted signal form.

Of which, rect( t ) = 1 and ∣ t   ∣   ≤ 1/2.

Step 5: The frequency modulation law of the time-frequency composite weighted signal of multi-thread fuzzy image is a hyperbolic function;

among them, K  =  Tf max f min / B , t 0  =  f 0 T / B , f 0 is arithmetic center frequency, and f max , f min are the minimum and maximum frequencies, respectively.

Step 6: Use the image transformation formula of the multi-detector fuzzy image time-frequency composite weighted signal to carry on the time-frequency composite weighting to the image, the definition of the image transformation is like the formula.

Of which, \( {b}_a=\left(1-a\right)\left(\frac{1}{afm_{ax}}-\frac{T}{2}\right) \) , and Ei (•) represents an exponential integral.

Final output image time-frequency composite weighted image signal W u u ( a ,  b ). Therefore, compared with the traditional time-domain, c extraction technique of image features can be better realized by the time-frequency composite weighting algorithm.

2.3 Application of deep convolution neural network in image classification

After obtaining the feature vectors from the image, the image can be described as a vector of fixed length, and then a classifier is needed to classify the feature vectors.

In general, a common convolution neural network consists of input layer, convolution layer, activation layer, pool layer, full connection layer, and final output layer from input to output. The convolutional neural network layer establishes the relationship between different computational neural nodes and transfers input information layer by layer, and the continuous convolution-pool structure decodes, deduces, converges, and maps the feature signals of the original data to the hidden layer feature space [ 28 ]. The next full connection layer classifies and outputs according to the extracted features.

2.3.1 Convolution neural network

Convolution is an important analytical operation in mathematics. It is a mathematical operator that generates a third function from two functions f and g , representing the area of overlap between function f and function g that has been flipped or translated. Its calculation is usually defined by a following formula:

Its integral form is the following:

In image processing, a digital image can be regarded as a discrete function of a two-dimensional space, denoted as f ( x , y ). Assuming the existence of a two-dimensional convolution function g ( x , y ), the output image z ( x , y ) can be represented by the following formula:

In this way, the convolution operation can be used to extract the image features. Similarly, in depth learning applications, when the input is a color image containing RGB three channels, and the image is composed of each pixel, the input is a high-dimensional array of 3 × image width × image length; accordingly, the kernel (called “convolution kernel” in the convolution neural network) is defined in the learning algorithm as the accounting. Computational parameter is also a high-dimensional array. Then, when two-dimensional images are input, the corresponding convolution operation can be expressed by the following formula:

The integral form is the following:

If a convolution kernel of m  ×  n is given, there is

where f represents the input image G to denote the size of the convolution kernel m and n . In a computer, the realization of convolution is usually represented by the product of a matrix. Suppose the size of an image is M × M and the size of the convolution kernel is n  ×  n . In computation, the convolution kernel multiplies with each image region of n  ×  n size of the image, which is equivalent to extracting the image region of n  ×  n and expressing it as a column vector of n  ×  n length. In a zero-zero padding operation with a step of 1, a total of ( M  −  n  + 1)  ∗  ( M  −  n  + 1) calculation results can be obtained; when these small image regions are each represented as a column vector of n  ×  n , the original image can be represented by the matrix [ n ∗ n ∗ ( M  −  n  + 1)]. Assuming that the number of convolution kernels is K , the output of the original image obtained by the above convolution operation is k ∗ ( M  −  n  + 1)  ∗  ( M  −  n  + 1). The output is the number of convolution kernels × the image width after convolution × the image length after convolution.

2.3.2 M 3 CE constructed loss function

In the process of neural network training, the loss function is the evaluation standard of the whole network model. It not only represents the current state of the network parameters, but also provides the gradient of the parameters in the gradient descent method, so the loss function is an important part of the deep learning training. In this paper, we introduce the loss function proposed by M 3  CE. Finally, the loss function of M 3  CE and cross-entropy is obtained by gradient analysis.

According to the definition of MCE, we use the output of Softmax function as the discriminant function. Then, the error classification metric formula is redefined as.

Where k is the label of the sample, q  = arg max l  ≠  k , P l represents the most confusing class of output of the Softmax function. If we use the logistic loss function, we can find the gradient of the loss function to Z .

This gradient is used in the backpropagation algorithm to get the gradient of the entire network, and it is worth noting that if z is misdivided,ℓ k will be infinitely close to 1, and a ℓ k (1 − ℓ k ) will be close to 0. Then, the gradient will be close to 0, which will cause almost no gradient to be reversed to the previous layer, which will not be good for the completion of the training process [ 29 ].

The sigmoid function is used in the traditional neural network activation function. But this is also the case during training. The observation formula shows that when the activation value is high the backpropagation gradient is very small which is called saturation. In the past, the influence of shallow neural networks was not very large, but with the increase of the number of network layers, this situation would affect the learning of the whole network. In particular, if the saturated sigmoid function is at a higher level, it will affect all the previous low-level gradients. Therefore, in the present depth neural networks, an unsaturated activation function linear rectifier unit (Rectified Linear Unit, Re LU) is used to replace the sigmoid function. It can be seen from the formula that when the input value is positive, the gradient of the linear rectifying unit is 1, so the gradient of the upper layer can be reversely transmitted to the lower layer without attenuation. The literature shows that linear rectification units can accelerate the training process and prevent gradient dispersion.

According to the fact that the saturation activation function in the middle of the network is not conducive to the training of the depth network, but the saturation function in the top loss function, has a great influence on the depth neural network.

We call it the max-margin loss, where the interval is defined as ∈ k  =  −  d k ( z ) =  P k  −  P q .

Since P k is a probability, that is, P k   ∈  [0, 1], then d k   ∈  [−1, 1], when a sample gradually becomes misclassified from the correct classification, d k increases from − 1 to 1, compared to the original logistic loss function, and even if the sample is seriously misclassified, the loss function still get the biggest loss value. Because of1 +  d k  ≥ 0, it can be simplified.

When we need to give a larger loss value to the wrong classification sample, the upper formula can be extended to

where γ is a positive integer. If γ  = 2 is set, we get the squared maximum interval loss function. If the function is to be applied to training deep neural networks, the gradient needs to be calculated and obtained according to the chain rule.

Here, we need to discuss (1) when the dimension is the dimension corresponding to the sample label, (2) when the dimension is the dimension corresponding to the confused category label, and (3) when the dimension is neither the sample label nor the dimension corresponding to the confused category label. The following conclusions have been drawn:

3 Experimental results

3.1 experimental platform and data preprocessing.

MNIST (Mixed National Institute of Standards and Technology) database is a standard database in machine learning. It consists of ten types of handwritten digital grayscale images, of which 60,000 training pictures are tested with a resolution of 28 × 28.

In this paper, we mainly use ZCA whitening to process the image data, such as reading the data into the array and reforming the size we need (Figs.  1 , 2 , 3 , 4 , and 5 ). The image of the data set is normalized and whitened respectively. It makes all pixels have the same mean value and variance, eliminates the white noise problem in the image, and eliminates the correlation between pixels and pixels.

figure 1

ZCA whitening flow chart

figure 2

Sample selection of different fonts and different colors

figure 3

Comparison of image feature extraction

figure 4

Image classification and modeling based on deep convolution neural network

figure 5

Comparison of recognition rates among different species

At the same time, a common way to change the results of image training is a random form of distortion, cropping, or sharpening the training input, which has the advantage of extending the effective size of the training data, thanks to all possible changes in the same image. And it tends to help network learning to deal with all distortion problems that will occur in the real use of classifiers. Therefore, when the training results are abnormal, the images will be deformed randomly to avoid the large interference caused by individual abnormal images to the whole model.

3.2 Build a training network

Classification algorithm is a relatively large class of algorithms, and image classification algorithm is one of them. Common classification algorithms are support vector machine, k -nearest algorithm, random forest, and so on. In image classification, support vector machine (SVM) based on the maximum boundary is the most widely used classification algorithm, especially the support vector machine (SVM) which uses kernel techniques. Support vector machine (SVM) is based on VC dimension theory and structural risk minimization theory. Its main purpose is to find the optimal classification hyperplane in high dimensional space so that the classification spacing is in maximum and the classification error rate is minimized. But it is more suitable for the case where the feature dimension of the image is small and the amount of data is large after extracting the special vector.

Another commonly used target recognition method is the depth learning model, which describes the image by hierarchical feature representation. The mainstream depth learning networks include constrained Boltzmann machine, depth belief network foot, automatic encoder, convolution neural network, biological model, and so on. We tested the proposed M3 CE-CEc. We design different convolution neural networks for different datasets. The experimental settings are as follows: the weight parameters are initialized randomly, the bias parameters are set as constants, the basic learning rate is set to 0.01, and the impulse term is set to 0.9. In the course of training, when the error rate is no longer decreasing, the learning rate is multiplied by 0.1.

3.3 Image classification and modeling based on deep convolution neural network

The following is a model for image classification based on deep convolution neural networks.

Input: Input is a collection of N images; each image label is one of the K classification tags. This set is called the training set.

Learning: The task of this step is to use the training set to learn exactly what each class looks like. This step is generally called a training classifier or learning a model.

Evaluation: The classifier is used to predict the classification labels of images it has not seen and to evaluate the quality of the classifiers. We compare the labels predicted by the classifier with the real labels of the image. There is no doubt that the classification labels predicted by the classifier are consistent with the true classification labels of the image, which is a good thing, and the more such cases, the better.

3.4 Evaluation index

In this paper, the image recognition effect is mainly divided into three parts: the overall classification accuracy, classification accuracy of different categories, and classification time consumption. The classification accuracy of an image includes the accuracy of the overall image classification and the accuracy of each classification. Assuming that nij represents the number of images in category I divided into category j , the accuracy of the overall classification is as follows:

The accuracy of each classification is as follows:

Run time is the average time to read a picture to get a classification result.

4 Discussion

4.1 comparison of classification effects of different loss functions.

We compare the images of the traditional logistic loss function with our proposed maximum interval loss function. It can be clearly seen that the value of the loss function increases with the increase of the severity of the misclassification, which indicates that the loss function can effectively express the error degree of the classification.

4.2 Comparison of recognition rates between the same species

Classification

Bicycle

Car

Bus

Motor

Flower

Definition

0.82

0.84

0.81

0.80

0.85

4.3 Comparison of recognition rates among different species

As can be seen from the following table, the recognition rate of this method is generally the same among different species, reaching more than 80% level, among which the accuracy of this method is relatively high in classifying clearly defined images such as cars. This may be due to the fact that clearly defined images have greater advantages in feature extraction.

4.4 Time-consuming comparison of SVM, KNN, BP, and CNN methods

On the premise of feature extraction using the same loss function method constructed by M 3  CE, the selection of classifier is the key factor to affect the automatic detection accuracy of human physiological function. Therefore, this paper discusses the influence of different classifiers on classification accuracy in this part (Table  1 ). The following table summarizes the influence of some common classifiers on classification accuracy. These classifiers include linear kernel support vector machine (SVM-Linear), Gao Si kernel support vector machine (SVM-RBF), and Naive Bayes (NB) (NB) k -nearest neighbor (KNN), random forest (RF), and decision. Strategy tree (DT) and gradient elevation decision tree (GBDT).

The experimental results show that the accuracy of CNN classifier is higher than that of other classifiers in training set and test set. Although the speed of DT is the fastest when it is used for automatic detection of human physiological function in the classifier contrast experiment, its accuracy on the test set is only 69.47% unacceptable.由In this paper, the following conclusions can be drawn in the comparison experiment of classifier: compared with other six common classifiers, CNN has the highest accuracy, and the spending of 6 s is acceptable in the seven classifiers of comparison.

First, because each test image needs to be compared with all the stored training images, it takes up a lot of storage space, consumes a lot of computing resources, and takes a lot of time to calculate. Because in practice, we focus on testing efficiency far higher than training efficiency. In fact, the convolution neural network that we want to learn later reaches the other extreme in this trade-off: although the training takes a lot of time, once the training is completed, the classification of new test data is very fast. Such a model is in line with the actual use of the requirements.

5 Conclusions

Deep convolution neural networks are used to identify scaling, translation, and other forms of distortion-invariant images. In order to avoid explicit feature extraction, the convolutional network uses feature detection layer to learn from training data implicitly, and because of the weight sharing mechanism, neurons on the same feature mapping surface have the same weight. The ya training network can extract features by W parallel computation, and its parameters and computational complexity are obviously smaller than those of the traditional neural network. Its layout is closer to the actual biological neural network. Weight sharing can greatly reduce the complexity of the network structure. Especially, the multi-dimensional input vector image WDIN can effectively avoid the complexity of data reconstruction in the process of feature extraction and image classification. Deep convolution neural network has incomparable advantages in image feature representation and classification. However, many researchers still regard the deep convolutional neural network as a black box feature extraction model. To explore the connection between each layer of the deep convolutional neural network and the visual nervous system of the human brain, and how to make the deep neural network incremental, as human beings do, to compensate for learning, and to increase understanding of the details of the target object, further research is needed.

Abbreviations

Artificial neural network

Backpropagation

Convolutional neural network and Naive Bayes

Convolutional neural network

Multilayer perceptron

Omnidirectional image

Very fine spatial resolution

Virtual reality

E. Newman, M. Kilmer, L. Horesh, Image classification using local tensor singular value decompositions (IEEE, international workshop on computational advances in multi-sensor adaptive processing. IEEE, Willemstad, 2018), pp. 1–5.

Google Scholar  

X. Wang, C. Chen, Y. Cheng, et al, Zero-shot image classification based on deep feature extraction. United Kingdom: IEEE Transactions on Cognitive & Developmental Systems, 10 (2), 1–1 (2018).

A.A.M. Al-Saffar, H. Tao, M.A. Talab, Review of deep convolution neural network in image classification (International conference on radar, antenna, microwave, electronics, and telecommunications. IEEE, Jakarta, 2018), pp. 26–31.

A.B. Said, I. Jemel, R. Ejbali, et al., A hybrid approach for image classification based on sparse coding and wavelet decomposition (Ieee/acs, international conference on computer systems and applications. IEEE, Hammamet, 2018), pp. 63–68.

Huang G, Chen D, Li T, et al. Multi-Scale Dense Networks for Resource Efficient Image Classification. 2018.

V. Gupta, A. Bhavsar, Feature importance for human epithelial (HEp-2) cell image classification. J Imaging. 4 (3), 46 (2018).

Article   Google Scholar  

L. Yang, A.M. Maceachren, P. Mitra, et al., Visually-enabled active deep learning for (geo) text and image classification: a review. ISPRS Int. J. Geo-Inf. 7 (2), 65 (2018).

Chanti D A, Caplier A. Improving bag-of-visual-words towards effective facial expressive image classification Visigrapp, the, International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. 2018.

X. Long, H. Lu, Y. Peng, X. Wang, S. Feng, Image classification based on improved VLAD. Multimedia Tools Appl. 75 (10), 5533–5555 (2016).

B. Kieffer, M. Babaie, S. Kalra, et al., Convolutional neural networks for histopathology image classification: training vs. using pre-trained networks (International conference on image processing theory. IEEE, Montreal, 2018), pp. 1–6.

J. Zhao, T. Fan, L. Lü, H. Sun, J. Wang, Adaptive intelligent single particle optimizer based image de-noising in shearlet domain. Intelligent Automation & Soft Computing 23 (4), 661–666 (2017).

Mou L, Ghamisi P, Zhu X X. Unsupervised spectral-spatial feature learning via deep residual conv-Deconv network for hyperspectral image classification IEEE transactions on geoscience & Remote Sensing. 2018,(99):1–16.

Newman E, Kilmer M, Horesh L. Image classification using local tensor singular value decompositions IEEE, international workshop on computational advances in multi-sensor adaptive processing. IEEE, 2018:1–5.

S.A. Quadri, O. Sidek, Quantification of biofilm on flooring surface using image classification technique. Neural Comput. & Applic. 24 (7–8), 1815–1821 (2014).

X.-C. Yin, Q. Liu, H.-W. Hao, Z.-B. Wang, K. Huang, FMI image based rock structure classification using classifier combination. Neural Comput. & Applic. 20 (7), 955–963 (2011).

Z. Yan, V. Jagadeesh, D. Decoste, et al., HD-CNN: hierarchical deep convolutional neural network for image classification. Eprint Arxiv 4321-4329 (2014).

C. Zhang, X. Pan, H. Li, et al., A hybrid MLP-CNN classifier for very fine resolution remotely sensed image classification. Isprs Journal of Photogrammetry & Remote Sensing 140 , 133–144 (2018).

Chaib S, Yao H, Gu Y, et al. Deep feature extraction and combination for remote sensing image classification based on pre-trained CNN models. International Conference on Digital Image Processing. 2017:104203D.

S. Roychowdhury, J. Ren, Non-deep CNN for multi-modal image classification and feature learning: an azure-based model (IEEE international conference on big data. IEEE, Washington, D.C., 2017), pp. 2893–2812.

M.Z. Afzal, A. Kölsch, S. Ahmed, et al., Cutting the error by half: investigation of very deep CNN and advanced training strategies for document image classification (Iapr international conference on document analysis and recognition. IEEE computer Society, Kyoto, 2017), pp. 883–888.

X. Fu, L. Li, K. Mao, et al., in Chinese High Technology Letters . Remote sensing image classification based on CNN model (2017).

Sachin R, Sowmya V, Govind D, et al. Dependency of various color and intensity planes on CNN based image classification. International Symposium on Signal Processing and Intelligent Recognition Systems. Springer, Cham, Manipal, 2017:167–177.

Shima Y. Image augmentation for object image classification based on combination of pre-trained CNN and SVM. International Conference on Informatics, Electronics and Vision & 2017, International sSymposium in Computational Medical and Health Technology. 2018:1–6.

J.Y. Lee, J.W. Lim, E.J. Koh, A study of image classification using HMC method applying CNN ensemble in the infrared image. Journal of Electrical Engineering & Technology 13 (3), 1377–1382 (2018).

Zhang C, Pan X, Zhang S Q, et al. A rough set decision tree based Mlp-Cnn for very high resolution remotely sensed image classification. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017:1451–1454.

M. Kumar, Y.H. Mao, Y.H. Wang, T.R. Qiu, C. Yang, W.P. Zhang, Fuzzy theoretic approach to signals and systems: Static systems. Inf. Sci. 418 , 668–702 (2017).

W. Zhang, J. Yang, Y. Fang, H. Chen, Y. Mao, M. Kumar, Analytical fuzzy approach to biological data analysis. Saudi J Biol Sci. 24 (3), 563, 2017–573.

Z. Sun, F. Li, H. Huang, Large scale image classification based on CNN and parallel SVM. International conference on neural information processing (Springer, Cham, Manipal, 2017), pp. 545–555.

Sachin R, Sowmya V, Govind D, et al. Dependency of various color and intensity planes on CNN based image classification. 2017.

Download references

Acknowledgements

The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

About the author

Xin Mingyuan was born in Heihe, Heilongjiang, P.R. China, in 1983. She received the Master degree from harbin university of science and technology, P.R. China. Now, she works in School of computer and information engineering, Heihe University, His research interests include Artificial intelligence, data mining and information security.

Wang yong was born in Suihua, Heilongjiang, P.R. China, in 1979. She received the Master degree from qiqihaer university, P.R. China. Now, she works in School of Heihe University, His research interests include Artificial intelligence, Education information management.

This work was supported by University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (No.UNPYSCT-2017104). Scientific research items of basic research business of provincial higher education institutions of Heilongjiang Provincial Department of Education (No.2017-KYYWF-0353).

Availability of data and materials

Please contact author for data requests.

Author information

Authors and affiliations.

School of Computer and Information Engineering, Heihe University, No. 1 Xueyuan Road education science and technology zone, Heihe, Heilongjiang, China

Mingyuan Xin

Heihe University, No. 1 Xueyuan Road education science and technology zone, Heihe, Heilongjiang, China

You can also search for this author in PubMed   Google Scholar

Contributions

All authors take part in the discussion of the work described in this paper. XM wrote the first version of the paper. XM and WY did part experiments of the paper. XM revised the paper in a different version of the paper, respectively. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yong Wang .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Xin, M., Wang, Y. Research on image classification model based on deep convolution neural network. J Image Video Proc. 2019 , 40 (2019). https://doi.org/10.1186/s13640-019-0417-8

Download citation

Received : 17 October 2018

Accepted : 07 January 2019

Published : 11 February 2019

DOI : https://doi.org/10.1186/s13640-019-0417-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Convolution neural network
  • Image classification

research paper on image classification

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

This week: the arXiv Accessibility Forum

Help | Advanced Search

Quantum Physics

Title: quantum machine learning for image classification.

Abstract: Image classification, a pivotal task in multiple industries, faces computational challenges due to the burgeoning volume of visual data. This research addresses these challenges by introducing two quantum machine learning models that leverage the principles of quantum mechanics for effective computations. Our first model, a hybrid quantum neural network with parallel quantum circuits, enables the execution of computations even in the noisy intermediate-scale quantum era, where circuits with a large number of qubits are currently infeasible. This model demonstrated a record-breaking classification accuracy of 99.21% on the full MNIST dataset, surpassing the performance of known quantum-classical models, while having eight times fewer parameters than its classical counterpart. Also, the results of testing this hybrid model on a Medical MNIST (classification accuracy over 99%), and on CIFAR-10 (classification accuracy over 82%), can serve as evidence of the generalizability of the model and highlights the efficiency of quantum layers in distinguishing common features of input data. Our second model introduces a hybrid quantum neural network with a Quanvolutional layer, reducing image resolution via a convolution process. The model matches the performance of its classical counterpart, having four times fewer trainable parameters, and outperforms a classical model with equal weight parameters. These models represent advancements in quantum machine learning research and illuminate the path towards more accurate image classification systems.
Comments: 13 pages, 10 figures, 1 table
Subjects: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as: [quant-ph]
  (or [quant-ph] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
Journal reference: Mach. Learn.: Sci. Technol. 5(1), 015040 (2024)
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • INSPIRE HEP
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Fusion of local and important channel information for multi-attention hyperspectral image classification

  • Yin, Zhijian

In recent years, methods based on deep convolutional neural networks (CNNs) have gradually become the focus of research in the field of hyperspectral image (HSI) classification. It is well known that hyperspectral data itself contains spatial and spectral information. While CNN-based methods have advantages in extracting local spatial features, they are not good at handling spectral features and global information. Therefore, this paper proposes a multi-attention network that fuses local and key channel information to complete the task of HSI classification. First, the principal component analysis (PCA) is used to pre-process the HSI data. Second, a feature information fusion module based on the SE module and 2D convolution is constructed to fuse local spatial information and enhanced feature channel information. Third, the global covariance pooling function accelerates the convergence rate of the network. Finally, the fused features are sent to the Vision Transformer (ViT) module for position encoding to capture global sequential information and improve the hyperspectral image classification results. Experiments carried out on several typical three public datasets demonstrate that the proposed network method can provide competitive results compared to the other state-of-the-art HSI networks.

IMAGES

  1. (PDF) Image classification using Deep learning

    research paper on image classification

  2. (PDF) A study on Image Classification based on Deep Learning and Tensorflow

    research paper on image classification

  3. Classification of scientific research

    research paper on image classification

  4. (PDF) A Survey Paper on Image Classification and Methods of Image Mining

    research paper on image classification

  5. ImageNet Benchmark (Image Classification)

    research paper on image classification

  6. (PDF) survey on automated medical image classification using deep learning

    research paper on image classification

VIDEO

  1. बी.एड Second Year-8th Paper Classification of Information and Communication Technology

  2. UGC NTA NET/ WB SET / GENARAL PAPER 1/ RESEARCH APTITUDE

  3. Image classification with Python (Keras)

  4. Chapter 4-2: Image classifications

  5. Image Classification Model using PyTorch

  6. Image Classification Technique in Remote Sensing

COMMENTS

  1. Image Classification

    Image Classification

  2. Image classification using Deep learning

    The image classification is a classical problem of image processing, computer vision and machine learning fields. In this paper we study the image classification using deep learning. We use ...

  3. An Analysis Of Convolutional Neural Networks For Image Classification

    An Analysis Of Convolutional Neural Networks For Image ...

  4. Image Classification with Classic and Deep Learning Techniques

    To classify images based on their content is one of the most studied topics in the field of computer vision. Nowadays, this problem can be addressed using modern techniques such as Convolutional Neural Networks (CNN), but over the years different classical methods have been developed. In this report, we implement an image classifier using both classic computer vision and deep learning ...

  5. Deep Learning for Image Classification: A Review

    Abstract. Image classification is a cornerstone of computer vision and plays a crucial role in various fields. This paper pays close attention to some traditional deep-learning approaches to image classification. Although traditional approaches, including traditional machine learning approaches, are initially practical for image classification ...

  6. Deep Learning Approaches for Image Classification

    Deep Learning Approaches for Image Classification

  7. A Comprehensive Study of Vision Transformers in Image Classification Tasks

    Image Classification is a fundamental task in the field of computer vision that frequently serves as a benchmark for gauging advancements in Computer Vision. Over the past few years, significant progress has been made in image classification due to the emergence of deep learning. However, challenges still exist, such as modeling fine-grained visual information, high computation costs, the ...

  8. Image Classification based on CNN: Models and Modules

    With the recent development of deep learning techniques, deep learning methods are widely used in image classification tasks, especially for those based on convolutional neural networks (CNN). In this paper, a general overview on the image classification tasks will be presented. Besides, the differences and contributions to essential progress in the image classification tasks of the deep ...

  9. Image Classification

    01 Sep 2024. Paper. Code. **Image Classification** is a fundamental task in vision recognition that aims to understand and categorize an image as a whole under a specific label. Unlike [object detection] (/task/object-detection), which involves classification and location of multiple objects within an image, image classification typically ...

  10. Data Augmentation for Image Classification using Generative AI

    Scaling laws dictate that the performance of AI models is proportional to the amount of available data. Data augmentation is a promising solution to expanding the dataset size. Traditional approaches focused on augmentation using rotation, translation, and resizing. Recent approaches use generative AI models to improve dataset diversity. However, the generative methods struggle with issues ...

  11. Consecutive multiscale feature learning-based image classification

    Consecutive multiscale feature learning-based image ...

  12. IMAGE RECOGNITION USING MACHINE LEARNING

    The image classification is one of the most classical problem of image processing. This research paper about image classification by using deep neural network(DNN) or also known as Deep learning ...

  13. Image Classification Using Deep Neural Network

    Image Classification is widely used in various fields such as Plant leaf disease classification, facial expression classification. To make bulky images handy, image classification is done using the concept of a deep neural network. The proposed work implemented the VGG16 model to classify an image into one of the categories like living and non-living that is further classified into several ...

  14. Research on Image Classification Algorithm Based on Convolutional

    This paper focuses on the study of image classification algorithms based on convolutional neural networks, and at the same time compares and analyzes deep belief network algorithms, and summarizes the application characteristics of different algorithms. Export citation and abstract BibTeX RIS. Previous article in issue.

  15. (PDF) Image Classification using CNN

    Abstract and Figures. Image Classification is a fundamental task that attempts to comprehend an entire image as a whole. To classify images based on their content is one of the most studied topics ...

  16. Image Recognition Technology Based on Machine Learning

    With the development of machine learning for decades, there are still many problems unsolved, such as image recognition and location detection, image classification, image generation, speech recognition, natural language processing and so on. In the field of deep learning research, the research on image classification has always been the most basic, traditional and urgent research direction ...

  17. A multi-task deep learning approach for real-time view classification

    On-going research on image quality assessment mainly targets natural images and focuses on the various distortions of natural images during acquisition, compression, storage, and transmission 18,19.

  18. Research on image classification model based on deep convolution neural

    Research on image classification model based on deep ...

  19. Image classification using machine learning

    Image classification is a method of visual processing that distinguishes b etween several. categories of objectives based on image attributes. In pattern recogn ition and computer vision, it is ...

  20. Convolutional neural networks for image classification

    This paper describes a learning approach based on training convolutional neural networks (CNN) for a traffic sign classification system. In addition, it presents the preliminary classification results of applying this CNN to learn features and classify RGB-D images task. To determine the appropriate architecture, we explore the transfer learning technique called "fine tuning technique", of ...

  21. Image Classification Using CNN by Atul Sharma, Gurbakash Phonsa

    If an image belongs to the class A, then the algorithm must ensure that it must classify it as class A image. Convolutional neural network(CNN) is a technique which we can use for the image classification. This paper will show how the image classification works in case of cifar-10 dataset.

  22. [2304.09224] Quantum machine learning for image classification

    View a PDF of the paper titled Quantum machine learning for image classification, by Arsenii Senokosov and 4 other authors. Image classification, a pivotal task in multiple industries, faces computational challenges due to the burgeoning volume of visual data. This research addresses these challenges by introducing two quantum machine learning ...

  23. Fusion of local and important channel information for multi-attention

    In recent years, methods based on deep convolutional neural networks (CNNs) have gradually become the focus of research in the field of hyperspectral image (HSI) classification. It is well known that hyperspectral data itself contains spatial and spectral information. While CNN-based methods have advantages in extracting local spatial features, they are not good at handling spectral features ...