Optimization of software performance for classification and linking of administrative documents
- Authors: Slavin O.A.1,2
-
Affiliations:
- Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences
- LLC Smart Engines Service
- Issue: No 6 (2024)
- Pages: 48-58
- Section: DATA ANALYSIS
- URL: https://gynecology.orscience.ru/0132-3474/article/view/677612
- DOI: https://doi.org/10.31857/S0132347424060057
- EDN: https://elibrary.ru/dykmmm
- ID: 677612
Cite item
Abstract
The paper discusses technologies for optimizing software performance. Optimization methods are divided into high-level and low-level, as well as parallelization. An algorithm for classifying and linking fields in a recognized image of an administrative document is described. The features of the implementation of classification and linking tasks are listed, consisting of the use of constellations of text feature points and the modified Levenshtein distance. SDK Smart Document Engine and OCR Tesseract were used. Several ways are described to optimize the performance of the functions for classifying and linking document content. Optimization of the performance of the system for sorting a stream of images of administrative documents is also described. The proposed methods for optimizing software performance are suitable not only for implementing image processing algorithms but also for computational algorithms in which cyclic information processing is carried out. The method can be applied in modern CAD systems to analyze the content of recognized textual files.
Full Text

About the authors
O. A. Slavin
Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences; LLC Smart Engines Service
Author for correspondence.
Email: oslavin@isa.ru
Russian Federation, 44-2 Vavilova str, Moscow, 119333; 9 Prospect 60-Letiya Oktyabrya, Moscow, 117312
References
- Acar U.A., Blelloch G.E., Harper R. Selective memorization. ACM SIGPLAN Notices. 2003. V. 38. № 1. P. 14–25. https://doi.org/10.1145/640128.604133
- Tatarowicz A.L., Curino C., Jones E.P.C. and Madden S. Lookup Tables: Fine-Grained Partitioning for Distributed Databases. IEEE28th International Conference on Data Engineering. 2012. P. 102–113. https://doi.org/10.1109/ICDE.2012.26
- Harris D.M., Harris S.L. Digital Design and Computer Architecture, 2nd Edition. Morgam Kaufmann is an imprint of Elsevier Inc., Waltham, 2013. 720 p.
- Rusiñol M., Frinken V., Karatzas D., Bagdanov A.D., Lladós J. Multimodal page classification in Administrative document image streams. In: IJDAR. 2014. V. 17. № 4. P. 331–341. https://doi.org/10.1007/s10032-014-0225-8
- Slavin O.A., Pliskin E.L. Method for analyzing the structure of noisy images of administrative documents. Bulletin of the South Ural State University. Ser. Mathematical Modelling, Programming & Computer Software (Bulletin SUSU MMCS). 2022. V. 15. № 4. P. 80–89. https://doi.org/10.14529/mmp220407
- Slavin O.A., Farsobina V., Myshev A.V. Analyzing the content of business documents recognized with a large number of errors using modified Levenshtein distance. Cyber-Physical Systems: Intelligent Models and Algorithms. Springer Nature Switzerland AG. 2022. V. 417. P. 267–279. https://doi.org/10.1007/978-3-030-95116-0
- Bellavia F. SIFT Matching by Context Exposed. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022. https://doi.org/10.1109/TPAMI.2022.3161853
- Bay H., Tuytelaars T., Van Gool Luc. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding – CVIU. 2003. V. 110. № 3. P. 404–417.
- Du X., Wumo P., Bui T.D. Text line segmentation in handwritten documents using Mumford–Shah model. Pattern Recognition. 2009. V. 42. P. 3136–3145. https://doi.org/10.1016/j.patcog.2008.12.021
- Maraj A., Martin M.V., Makrehchi M. A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition – ICDAR2021. Lecture Notes in Computer Science, Springer, Cham. 2021. V. 12824. https://doi.org/10.1007/978-3-030-86337-1_16
- Kravets A.G., Salnikova N.A., Shestopalova E.L. Development of a Module for Predictive Modeling of Technological Development Trends. Cyber-Physical Systems. 2021. P. 125–136. https://doi.org/10.1007/978-3-030-67892-0_11
- Sabitov A., Minnikhanov R., Dagaeva M., Katasev A., Asliamov T. Text Classification in Emergency Calls Management Systems. Cyber-Physical Systems. 2021. P. 199–210. https://doi.org/10.1007/978-3-030-67892-0_17
- Deza M.M., Deza E. Encyclopedia of distances. Springer-Verlag, Berlin, xiv+590 pp. (2009)
- Yujian L., Bo L. A Normalized Levenshtein Distance Metric // IEEE Transactions on Pattern Analysis and Machine Intelligence. V. 29. № 6. P. 1091–1095. https://doi.org/10.1109/TPAMI.2007.1078 (2007)
- Intel® VTune™ Profiler Performance Analysis Cookbook. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023–2/overview.html. Accessed 23 Sep. 2023.
- Smart Document Engine – automatic analysis and data extraction from business documents for desktop, server and mobile platforms. https://smartengines.com/ocr-engines/document-scanner. Accessed 23 Sep. 2023.
- Intel(R) oneAPI Threading Building Blocks (oneTBB) Developer Guide and API Reference. https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021–10/overview.html. Accessed 23 Sep. 2023.
- OCR Tesseract. https://github.com/tesseract-ocr/tesseract. Accessed 23 Sep. 2023.
- NIST Special Database. https://www.nist.gov/srd/nist-special-database-2. Accessed 23 Sep. 2023.
- Tobacco-3482. https://www.kaggle.com/patrickaudriaz/tobacco3482jpg. Accessed 23 Sep. 2023.
- Kravets A.G., Egunov V. The Software Cache Optimization-Based Method for Decreasing Energy Consumption of Computational Clusters // Energies [Special Issue Smart Energy and Sustainable Environment]. 2022. V. 15. № 20. P. 7509. https://doi.org/10.3390/en15207509
- Crow F.C. Summed-area tables for texture mapping ACM SIGGRAPH Computer Graphics. 1984. V. 18. № 3. P. 207–212.
- Trusov A., Limonova E., Nikolaev D., Arlazarov V.V. 4.6-bit Quantization for Fast and Accurate Neural Network Inference on CPUs // Mathematics. 2024. V. 12. № 5. P. 651. https://doi.org/10.3390/math12050651
- Rybakova E.O., Limonova E.E., Nikolaev D.P. Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms // Applied Sciences. 2024. V. 14. № 11. P. 4664. https://doi.org/10.3390/app14114664
Supplementary files
