Optimization of software performance for classification and linking of administrative documents

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription Access

Abstract

The paper discusses technologies for optimizing software performance. Optimization methods are divided into high-level and low-level, as well as parallelization. An algorithm for classifying and linking fields in a recognized image of an administrative document is described. The features of the implementation of classification and linking tasks are listed, consisting of the use of constellations of text feature points and the modified Levenshtein distance. SDK Smart Document Engine and OCR Tesseract were used. Several ways are described to optimize the performance of the functions for classifying and linking document content. Optimization of the performance of the system for sorting a stream of images of administrative documents is also described. The proposed methods for optimizing software performance are suitable not only for implementing image processing algorithms but also for computational algorithms in which cyclic information processing is carried out. The method can be applied in modern CAD systems to analyze the content of recognized textual files.

Full Text

Restricted Access

About the authors

O. A. Slavin

Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences; LLC Smart Engines Service

Author for correspondence.
Email: oslavin@isa.ru
Russian Federation, 44-2 Vavilova str, Moscow, 119333; 9 Prospect 60-Letiya Oktyabrya, Moscow, 117312

References

  1. Acar U.A., Blelloch G.E., Harper R. Selective memorization. ACM SIGPLAN Notices. 2003. V. 38. № 1. P. 14–25. https://doi.org/10.1145/640128.604133
  2. Tatarowicz A.L., Curino C., Jones E.P.C. and Madden S. Lookup Tables: Fine-Grained Partitioning for Distributed Databases. IEEE28th International Conference on Data Engineering. 2012. P. 102–113. https://doi.org/10.1109/ICDE.2012.26
  3. Harris D.M., Harris S.L. Digital Design and Computer Architecture, 2nd Edition. Morgam Kaufmann is an imprint of Elsevier Inc., Waltham, 2013. 720 p.
  4. Rusiñol M., Frinken V., Karatzas D., Bagdanov A.D., Lladós J. Multimodal page classification in Administrative document image streams. In: IJDAR. 2014. V. 17. № 4. P. 331–341. https://doi.org/10.1007/s10032-014-0225-8
  5. Slavin O.A., Pliskin E.L. Method for analyzing the structure of noisy images of administrative documents. Bulletin of the South Ural State University. Ser. Mathematical Modelling, Programming & Computer Software (Bulletin SUSU MMCS). 2022. V. 15. № 4. P. 80–89. https://doi.org/10.14529/mmp220407
  6. Slavin O.A., Farsobina V., Myshev A.V. Analyzing the content of business documents recognized with a large number of errors using modified Levenshtein distance. Cyber-Physical Systems: Intelligent Models and Algorithms. Springer Nature Switzerland AG. 2022. V. 417. P. 267–279. https://doi.org/10.1007/978-3-030-95116-0
  7. Bellavia F. SIFT Matching by Context Exposed. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022. https://doi.org/10.1109/TPAMI.2022.3161853
  8. Bay H., Tuytelaars T., Van Gool Luc. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding – CVIU. 2003. V. 110. № 3. P. 404–417.
  9. Du X., Wumo P., Bui T.D. Text line segmentation in handwritten documents using Mumford–Shah model. Pattern Recognition. 2009. V. 42. P. 3136–3145. https://doi.org/10.1016/j.patcog.2008.12.021
  10. Maraj A., Martin M.V., Makrehchi M. A More Effective Sentence-Wise Text Segmentation Approach Using BERT. In: Lladós J., Lopresti D., Uchida S. (eds) Document Analysis and Recognition – ICDAR2021. Lecture Notes in Computer Science, Springer, Cham. 2021. V. 12824. https://doi.org/10.1007/978-3-030-86337-1_16
  11. Kravets A.G., Salnikova N.A., Shestopalova E.L. Development of a Module for Predictive Modeling of Technological Development Trends. Cyber-Physical Systems. 2021. P. 125–136. https://doi.org/10.1007/978-3-030-67892-0_11
  12. Sabitov A., Minnikhanov R., Dagaeva M., Katasev A., Asliamov T. Text Classification in Emergency Calls Management Systems. Cyber-Physical Systems. 2021. P. 199–210. https://doi.org/10.1007/978-3-030-67892-0_17
  13. Deza M.M., Deza E. Encyclopedia of distances. Springer-Verlag, Berlin, xiv+590 pp. (2009)
  14. Yujian L., Bo L. A Normalized Levenshtein Distance Metric // IEEE Transactions on Pattern Analysis and Machine Intelligence. V. 29. № 6. P. 1091–1095. https://doi.org/10.1109/TPAMI.2007.1078 (2007)
  15. Intel® VTune™ Profiler Performance Analysis Cookbook. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023–2/overview.html. Accessed 23 Sep. 2023.
  16. Smart Document Engine – automatic analysis and data extraction from business documents for desktop, server and mobile platforms. https://smartengines.com/ocr-engines/document-scanner. Accessed 23 Sep. 2023.
  17. Intel(R) oneAPI Threading Building Blocks (oneTBB) Developer Guide and API Reference. https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021–10/overview.html. Accessed 23 Sep. 2023.
  18. OCR Tesseract. https://github.com/tesseract-ocr/tesseract. Accessed 23 Sep. 2023.
  19. NIST Special Database. https://www.nist.gov/srd/nist-special-database-2. Accessed 23 Sep. 2023.
  20. Tobacco-3482. https://www.kaggle.com/patrickaudriaz/tobacco3482jpg. Accessed 23 Sep. 2023.
  21. Kravets A.G., Egunov V. The Software Cache Optimization-Based Method for Decreasing Energy Consumption of Computational Clusters // Energies [Special Issue Smart Energy and Sustainable Environment]. 2022. V. 15. № 20. P. 7509. https://doi.org/10.3390/en15207509
  22. Crow F.C. Summed-area tables for texture mapping ACM SIGGRAPH Computer Graphics. 1984. V. 18. № 3. P. 207–212.
  23. Trusov A., Limonova E., Nikolaev D., Arlazarov V.V. 4.6-bit Quantization for Fast and Accurate Neural Network Inference on CPUs // Mathematics. 2024. V. 12. № 5. P. 651. https://doi.org/10.3390/math12050651
  24. Rybakova E.O., Limonova E.E., Nikolaev D.P. Fast Gaussian Filter Approximations Comparison on SIMD Computing Platforms // Applied Sciences. 2024. V. 14. № 11. P. 4664. https://doi.org/10.3390/app14114664

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Fig. 1. Implementation of the original version of the substCost function.

Download (13KB)
3. Fig. 2. Implementation of the optimized version of the substCost function.

Download (17KB)
4. Fig. 3. Parallel implementation of the sorting system.

Download (46KB)

Copyright (c) 2024 Russian Academy of Sciences