SINTESIS TEKS KE GAMBAR: TINJAUAN ATAS DATASET

Suci Ramadhani Arifin(1*)

(1) Universitas Hasanuddin
(*) Corresponding Author

Sari


Penelitian ini bertujuan untuk melakukan analisis mendalam terhadap berbagai dataset yang digunakan dalam riset sintesis teks ke gambar. Fokus utama penelitian ini adalah pada pemahaman karakteristik masing-masing dataset, pengaruh pemilihan dataset terhadap hasil penelitian, serta keunggulan dan kelemahan setiap dataset. Beberapa dataset yang diteliti meliputi MS COCO, CUB-200-2011, dan Oxford 102 Flower, bersama dengan dataset-domain khusus lainnya yang relevan. Metode penelitian mencakup analisis deskriptif terhadap jumlah gambar, karakteristik visual, dan deskripsi teks yang melibatkan setiap dataset. Data yang diperoleh dianalisis secara kualitatif untuk mendapatkan wawasan mendalam tentang setiap dataset. Hasil analisis diharapkan dapat memberikan panduan bagi peneliti dalam memilih dataset yang sesuai dengan tujuan penelitian mereka dalam sintesis teks ke gambar. Penelitian ini diakhiri dengan rekomendasi dan kesimpulan yang merangkum temuan utama dan relevansinya dalam konteks riset ini.


Kata Kunci


sintesis; teks ke gambar; dataset

Teks Lengkap:

PDF

Referensi


S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, “Adversarial text-to-image synthesis: A review,” Neural Netw., vol. 144, pp. 187–209, Dec. 2021, doi: 10.1016/j.neunet.2021.07.019.

Y. Dong, Y. Zhang, L. Ma, Z. Wang, and J. Luo, “Unsupervised text-to-image synthesis,” Pattern Recognit., vol. 110, p. 107573, Feb. 2021, doi: 10.1016/j.patcog.2020.107573.

Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, “Visual question answering: A survey of methods and datasets,” Comput. Vis. Image Underst., vol. 163, pp. 21–40, Oct. 2017, doi: 10.1016/j.cviu.2017.05.001.

L. Alzubaidi et al., “A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications,” J. Big Data, vol. 10, no. 1, p. 46, Apr. 2023, doi: 10.1186/s40537-023-00727-2.

E. S. Jo and T. Gebru, “Lessons from archives: strategies for collecting sociocultural data in machine learning,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, in FAT* ’20. New York, NY, USA: Association for Computing Machinery, Jan. 2020, pp. 306–316. doi: 10.1145/3351095.3372829.

Md. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, and M. Bennamoun, “Text to Image Synthesis for Improved Image Captioning,” IEEE Access, vol. 9, pp. 64918–64928, 2021, doi: 10.1109/ACCESS.2021.3075579.

M. M. Bejani and M. Ghatee, “A systematic review on overfitting control in shallow and deep neural networks,” Artif. Intell. Rev., vol. 54, no. 8, pp. 6391–6438, Dec. 2021, doi: 10.1007/s10462-021-09975-1.

Y. Yang, Y. Zhuang, and Y. Pan, “Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies,” Front. Inf. Technol. Electron. Eng., vol. 22, no. 12, pp. 1551–1558, Dec. 2021, doi: 10.1631/FITEE.2100463.

T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 2980–2988. doi: 10.1109/ICCV.2017.322.

J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement.” arXiv, Apr. 08, 2018. doi: 10.48550/arXiv.1804.02767.

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection.” arXiv, Feb. 07, 2018. doi: 10.48550/arXiv.1708.02002.

M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and Efficient Object Detection.” arXiv, Jul. 27, 2020. doi: 10.48550/arXiv.1911.09070.

Y. Wang et al., “Pruning from Scratch.” arXiv, Sep. 27, 2019. doi: 10.48550/arXiv.1909.12579.

K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: Hardware-Aware Automated Quantization with Mixed Precision.” arXiv, Apr. 06, 2019. doi: 10.48550/arXiv.1811.08886.

G. Ghiasi, T.-Y. Lin, R. Pang, and Q. V. Le, “NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection.” arXiv, Apr. 15, 2019. doi: 10.48550/arXiv.1904.07392.

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Network.” arXiv, Apr. 27, 2017. doi: 10.48550/arXiv.1612.01105.

L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, pp. 833–851. doi: 10.1007/978-3-030-01234-2_49.

Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: Criss-Cross Attention for Semantic Segmentation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, pp. 603–612. doi: 10.1109/ICCV.2019.00069.

H. Zhang et al., “Context Encoding for Semantic Segmentation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp. 7151–7160. doi: 10.1109/CVPR.2018.00747.

R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Transformer for Semantic Segmentation,” presented at the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE Computer Society, Oct. 2021, pp. 7242–7252. doi: 10.1109/ICCV48922.2021.00717.

X. Chen, H. Fan, R. Girshick, and K. He, “Improved Baselines with Momentum Contrastive Learning.” arXiv, Mar. 09, 2020. doi: 10.48550/arXiv.2003.04297.

M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, in NIPS’20. Red Hook, NY, USA: Curran Associates Inc., Dec. 2020, pp. 9912–9924.

J. Aneja, H. Agrawal, D. Batra, and A. Schwing, “Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning.” arXiv, Aug. 22, 2019. doi: 10.48550/arXiv.1908.08529.

Q. Cai, Y. Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring Object Relation in Mean Teacher for Cross-Domain Detection.” arXiv, Dec. 25, 2019. doi: 10.48550/arXiv.1904.11245.

D. Mahajan et al., “Exploring the Limits of Weakly Supervised Pretraining.” arXiv, May 02, 2018. doi: 10.48550/arXiv.1805.00932.

M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-Memory Transformer for Image Captioning.” arXiv, Mar. 20, 2020. doi: 10.48550/arXiv.1912.08226.

J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural Baby Talk.” arXiv, Mar. 26, 2018. doi: 10.48550/arXiv.1803.09845.

H. Nam, J.-W. Ha, and J. Kim, “Dual Attention Networks for Multimodal Reasoning and Matching.” arXiv, Mar. 21, 2017. doi: 10.48550/arXiv.1611.00471.

A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic Feature Pyramid Networks,” presented at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Jun. 2019, pp. 6392–6401. doi: 10.1109/CVPR.2019.00656.

S. Jin et al., “Towards Multi-Person Pose Tracking : Bottom-up and Top-down Methods,” 2017. Accessed: Nov. 11, 2023. [Online]. Available: https://www.semanticscholar.org/paper/Towards-Multi-Person-Pose-Tracking-%3A-Bottom-up-and-Jin-Ma/d5ad0521f148bd74cf90ac150786f8467775a58b

Y. Zhang and T. Funkhouser, “Deep Depth Completion of a Single RGB-D Image.” arXiv, May 01, 2018. doi: 10.48550/arXiv.1803.09326.

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Network for Monocular Depth Estimation.” arXiv, Jun. 06, 2018. doi: 10.48550/arXiv.1806.02446.

Y.-X. Wang, D. Ramanan, and M. Hebert, “Learning to Model the Tail,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017. Accessed: Nov. 11, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2017/hash/147ebe637038ca50a1265abac8dea181-Abstract.html

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked Attention Networks for Image Question Answering.” arXiv, Jan. 26, 2016. doi: 10.48550/arXiv.1511.02274.

C. Gu et al., “AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions.” arXiv, Apr. 30, 2018. doi: 10.48550/arXiv.1705.08421.

E. Teh, M. Rochan, and Y. Wang, “Attention Networks for Weakly Supervised Object Localization,” Jan. 2016, p. 52.1-52.11. doi: 10.5244/C.30.52.

C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 843–852. doi: 10.1109/ICCV.2017.97.

G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, pp. 5000–5009. doi: 10.1109/ICCV.2017.534.

D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari, “Training object class detectors with click supervision.” arXiv, May 19, 2017. doi: 10.48550/arXiv.1704.06189.

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds200-2011 Dataset,” Adv. Water Resour. - ADV WATER RESOUR, Jul. 2011.

J. Gao, T. Zhang, and C. Xu, “Graph Convolutional Tracking,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, pp. 4644–4654. doi: 10.1109/CVPR.2019.00478.

W. Ge, X. Lin, and Y. Yu, “Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification from the Bottom Up.” arXiv, Mar. 07, 2019. doi: 10.48550/arXiv.1903.02827.

W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, and J.-B. Huang, “A Closer Look at Few-shot Classification.” arXiv, Jan. 12, 2020. doi: 10.48550/arXiv.1904.04232.

H. Gharoun, F. Momenifar, F. Chen, and A. H. Gandomi, “Meta-learning approaches for few-shot learning: A survey of recent advances.” arXiv, Mar. 13, 2023. doi: 10.48550/arXiv.2303.07502.

J. He et al., “TransFG: A Transformer Architecture for Fine-Grained Recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, Jun. 2022, pp. 852–860. doi: 10.1609/aaai.v36i1.19967.

J. Liu et al., “Evidence for dynamic attentional bias toward positive emotion-laden words: A behavioral and electrophysiological study,” Front. Psychol., vol. 13, 2022, Accessed: Nov. 12, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fpsyg.2022.966774

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in Vision: A Survey,” ACM Comput. Surv., vol. 54, no. 10s, p. 200:1-200:41, Sep. 2022, doi: 10.1145/3505244.

Z. Tian, S. Chen, M. Li, K. Liao, P. Zhang, and W. Zhao, “Dual-Modality Feature Extraction Network Based on Graph Attention for RGBT Tracking,” in Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System, in CCRIS ’22. New York, NY, USA: Association for Computing Machinery, Oct. 2022, pp. 248–253. doi: 10.1145/3562007.3562054.

K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Image-Text Embedding Learning via Visual and Textual Semantic Reasoning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 641–656, Jan. 2023, doi: 10.1109/TPAMI.2022.3148470.

C. Hu, L. Zhu, W. Qiu, and W. Wu, “Data Augmentation Vision Transformer for Fine-grained Image Classification.” arXiv, Nov. 24, 2022. doi: 10.48550/arXiv.2211.12879.

X. Li, B. Wang, and B. Qiu, “Research on fine-grained image classification based on deep learning,” in 4th International Conference on Information Science, Electrical, and Automation Engineering (ISEAE 2022), M. (Milly) Cen and L. Wang, Eds., Guangzhou, China: SPIE, Aug. 2022, p. 41. doi: 10.1117/12.2640112.

M.-E. Nilsback and A. Zisserman, “Delving deeper into the whorl of flower segmentation,” Image Vis. Comput., vol. 28, no. 6, pp. 1049–1062, Jun. 2010, doi: 10.1016/j.imavis.2009.10.001.

C. Li, C. Liu, L. Duan, P. Gao, and K. Zheng, “Reconstruction Regularized Deep Metric Learning for Multi-Label Image Classification,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2294–2303, Jul. 2020, doi: 10.1109/TNNLS.2019.2924023.

X.-M. Zhang, L. Liang, L. Liu, and M.-J. Tang, “Graph Neural Networks and Their Current Applications in Bioinformatics,” Front. Genet., vol. 12, 2021, Accessed: Nov. 12, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fgene.2021.690049

Y. Wang et al., “Capsule Networks Showed Excellent Performance in the Classification of hERG Blockers/Nonblockers,” Front. Pharmacol., vol. 10, 2020, Accessed: Nov. 12, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fphar.2019.01631

K. Dwivedi and G. Roig, “Representation Similarity Analysis for Efficient Task Taxonomy & Transfer Learning,” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12387–12396. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2019/html/Dwivedi_Representation_Similarity_Analysis_for_Efficient_Task_Taxonomy__Transfer_Learning_CVPR_2019_paper.html

F. Zhao, P. Zhang, R. Zhang, and M. Li, “UnifiedFace: A Uniform Margin Loss Function for Face Recognition,” Appl. Sci., vol. 13, no. 4, Art. no. 4, Jan. 2023, doi: 10.3390/app13042350.

Y.-Y. Zheng, J.-L. Kong, X.-B. Jin, X.-Y. Wang, T.-L. Su, and J.-L. Wang, “Probability Fusion Decision Framework of Multiple Deep Neural Networks for Fine-Grained Visual Classification,” IEEE Access, vol. 7, pp. 122740–122757, 2019, doi: 10.1109/ACCESS.2019.2933169.

J. Zhang et al., “Class-incremental Learning via Deep Model Consolidation,” presented at the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 1131–1140. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_WACV_2020/html/Zhang_Class-incremental_Learning_via_Deep_Model_Consolidation_WACV_2020_paper.html

M. Oza, S. Chanda, and D. Doermann, “Semantic Text-to-Face GAN -ST^2FG.” arXiv, Aug. 26, 2022. doi: 10.48550/arXiv.2107.10756.

O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery,” presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2085–2094. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2021/html/Patashnik_StyleCLIP_Text-Driven_Manipulation_of_StyleGAN_Imagery_ICCV_2021_paper.html

S. Liu, T. Wang, D. Bau, J.-Y. Zhu, and A. Torralba, “Diverse Image Generation via Self-Conditioned GANs.” arXiv, Feb. 09, 2022. doi: 10.48550/arXiv.2006.10728.

S. Naveen, M. S. S. Ram Kiran, M. Indupriya, T. V. Manikanta, and P. V. Sudeep, “Transformer models for enhancing AttnGAN based text to image generation,” Image Vis. Comput., vol. 115, p. 104284, Nov. 2021, doi: 10.1016/j.imavis.2021.104284.

S. Pande, S. Chouhan, R. Sonavane, R. Walambe, G. Ghinea, and K. Kotecha, “Development and deployment of a generative model-based framework for text to photorealistic image generation,” Neurocomputing, vol. 463, pp. 1–16, Nov. 2021, doi: 10.1016/j.neucom.2021.08.055.

A. Nickabadi, M. S. Fard, N. M. Farid, and N. Mohammadbagheri, “A comprehensive survey on semantic facial attribute editing using generative adversarial networks.” arXiv, May 21, 2022. doi: 10.48550/arXiv.2205.10587.

C. Schuhmann et al., “LAION-5B: An open large-scale dataset for training next generation image-text models,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 25278–25294, Dec. 2022.

A. Birhane, V. Prabhu, S. Han, V. N. Boddeti, and A. S. Luccioni, “Into the LAIONs Den: Investigating Hate in Multimodal Datasets.” arXiv, Nov. 06, 2023. doi: 10.48550/arXiv.2311.03449.

P. Chambon, C. Bluethgen, C. P. Langlotz, and A. Chaudhari, “Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains.” arXiv, Oct. 08, 2022. doi: 10.48550/arXiv.2210.04133.

J. Qiu et al., “Large AI Models in Health Informatics: Applications, Challenges, and the Future,” IEEE J. Biomed. Health Inform., pp. 1–14, 2023, doi: 10.1109/JBHI.2023.3316750.

N. Carlini et al., “Poisoning Web-Scale Training Datasets is Practical.” arXiv, Feb. 20, 2023. doi: 10.48550/arXiv.2302.10149.

P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models,” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22522–22531. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2023/html/Schramowski_Safe_Latent_Diffusion_Mitigating_Inappropriate_Degeneration_in_Diffusion_Models_CVPR_2023_paper.html

Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning.” arXiv, Feb. 19, 2020. doi: 10.48550/arXiv.1911.05371.

N. Rostamzadeh et al., “Fashion-Gen: The Generative Fashion Dataset and Challenge.” arXiv, Jul. 30, 2018. doi: 10.48550/arXiv.1806.08317.

F. Li, L. Zhu, T. Wang, J. Li, Z. Zhang, and H. T. Shen, “Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions.” arXiv, Oct. 26, 2023. doi: 10.48550/arXiv.2308.14263.

S. Luo, “A Survey on Multimodal Deep Learning for Image Synthesis: Applications, methods, datasets, evaluation metrics, and results comparison,” in Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence, in ICIAI ’21. New York, NY, USA: Association for Computing Machinery, Sep. 2021, pp. 108–120. doi: 10.1145/3461353.3461388.

A. M. Shoib, J. Summaira, C. Wang, and A. Jabbar, “Methods and advancement of content-based fashion image retrieval: A Review.” arXiv, Mar. 30, 2023. doi: 10.48550/arXiv.2303.17371.

L. Kumar and D. K. Singh, “A comprehensive survey on generative adversarial networks used for synthesizing multimedia content,” Multimed. Tools Appl., vol. 82, no. 26, pp. 40585–40624, Nov. 2023, doi: 10.1007/s11042-023-15138-x.

S. Mirchandani et al., “FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning.” arXiv, Oct. 26, 2022. doi: 10.48550/arXiv.2210.15028.

Y. Ding, Z. Lai, P. Y. Mok, and T.-S. Chua, “Computational Technologies for Fashion Recommendation: A Survey,” ACM Comput. Surv., Oct. 2023, doi: 10.1145/3627100.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do ImageNet Classifiers Generalize to ImageNet?” arXiv, Jun. 12, 2019. doi: 10.48550/arXiv.1902.10811.

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2012. Accessed: Nov. 12, 2023. [Online]. Available: https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv, Apr. 10, 2015. doi: 10.48550/arXiv.1409.1556.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the 36th International Conference on Machine Learning, PMLR, May 2019, pp. 6105–6114. Accessed: Nov. 12, 2023. [Online]. Available: https://proceedings.mlr.press/v97/tan19a.html

P. Stock and M. Cisse, “ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases,” presented at the Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 498–512. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_ECCV_2018/html/Pierre_Stock_ConvNets_and_ImageNet_ECCV_2018_paper.html

O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge.” arXiv, Jan. 29, 2015. doi: 10.48550/arXiv.1409.0575.

A. Kuznetsova et al., “The Open Images Dataset V4,” Int. J. Comput. Vis., vol. 128, no. 7, pp. 1956–1981, Jul. 2020, doi: 10.1007/s11263-020-01316-z.

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection.” arXiv, Apr. 22, 2020. doi: 10.48550/arXiv.2004.10934.

S. Qiao, L.-C. Chen, and A. Yuille, “DetectoRS: Detecting Objects With Recursive Feature Pyramid and Switchable Atrous Convolution,” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10213–10224. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2021/html/Qiao_DetectoRS_Detecting_Objects_With_Recursive_Feature_Pyramid_and_Switchable_Atrous_CVPR_2021_paper.html

F. J. Moreno-Rodríguez, V. J. Traver, F. Barranco, M. Dimiccoli, and F. Pla, “Visual Event-Based Egocentric Human Action Recognition,” in Pattern Recognition and Image Analysis, A. J. Pinho, P. Georgieva, L. F. Teixeira, and J. A. Sánchez, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2022, pp. 402–414. doi: 10.1007/978-3-031-04881-4_32.

Z.-W. Yuan and J. Zhang, “Feature extraction and image retrieval based on AlexNet,” in Eighth International Conference on Digital Image Processing (ICDIP 2016), SPIE, Aug. 2016, pp. 65–69. doi: 10.1117/12.2243849.

S. Antol et al., “VQA: Visual Question Answering,” presented at the Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2425–2433. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts,” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2021/html/Changpinyo_Conceptual_12M_Pushing_Web-Scale_Image-Text_Pre-Training_To_Recognize_Long-Tail_Visual_CVPR_2021_paper.html

A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in Proceedings of the 38th International Conference on Machine Learning, PMLR, Jul. 2021, pp. 8748–8763. Accessed: Nov. 12, 2023. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html

Y. Huang et al., “CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition,” presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5901–5910. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Huang_CurricularFace_Adaptive_Curriculum_Learning_Loss_for_Deep_Face_Recognition_CVPR_2020_paper.html

A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A Survey on Contrastive Self-Supervised Learning,” Technologies, vol. 9, no. 1, Art. no. 1, Mar. 2021, doi: 10.3390/technologies9010002.

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguist., vol. 2, pp. 67–78, 2014, doi: 10.1162/tacl_a_00166.

T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui, “Table-to-Text Generation by Structure-Aware Seq2seq Learning,” Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, Art. no. 1, Apr. 2018, doi: 10.1609/aaai.v32i1.11925.

J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical Question-Image Co-Attention for Visual Question Answering,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2016. Accessed: Nov. 12, 2023. [Online]. Available: https://proceedings.neurips.cc/paper/2016/hash/9dcb88e0137649590b755372b040afad-Abstract.html

Y. Gu, C. Li, and J. Xie, “Attention-Aware Generalized Mean Pooling for Image Retrieval.” arXiv, Jan. 28, 2019. doi: 10.48550/arXiv.1811.00202.

A. Sadhu, K. Chen, and R. Nevatia, “Zero-Shot Grounding of Objects From Natural Language Queries,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 4693–4702. doi: 10.1109/ICCV.2019.00479.

M. Kilickaya, B. K. Akkus, R. Cakici, A. Erdem, E. Erdem, and N. Ikizler‐Cinbis, “Data‐driven image captioning via salient region discovery,” IET Comput. Vis., vol. 11, no. 6, pp. 398–406, Sep. 2017, doi: 10.1049/iet-cvi.2016.0286.

Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie, “Learning to Evaluate Image Captioning,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5804–5812. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/Cui_Learning_to_Evaluate_CVPR_2018_paper.html

I. Laina, C. Rupprecht, and N. Navab, “Towards Unsupervised Image Captioning With Shared Multimodal Embeddings,” presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7414–7424. Accessed: Nov. 12, 2023. [Online]. Available: https://openaccess.thecvf.com/content_ICCV_2019/html/Laina_Towards_Unsupervised_Image_Captioning_With_Shared_Multimodal_Embeddings_ICCV_2019_paper.html

B. Thomee et al., “YFCC100M: the new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, Jan. 2016, doi: 10.1145/2812802.

C. Jia et al., “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” in Proceedings of the 38th International Conference on Machine Learning, PMLR, Jul. 2021, pp. 4904–4916. Accessed: Nov. 12, 2023. [Online]. Available: https://proceedings.mlr.press/v139/jia21b.html




DOI: http://dx.doi.org/10.31602/eeict.v7i1.13066

Refbacks

  • Saat ini tidak ada refbacks.


EEICT (Electric, Electronic, Instrumentation, Control, Telecommunication) disebar luaskan oleh : Lisensi Creative Commons Atribusi 4.0 Internasional.