IS&T | Library

Automatic Reading Order Sequencing: A Novel Reading Order Generator for Text-based Documents

Abstract

The usability and accessibility of digitised archival data can be improved using deep learning solutions. In this paper, the authors present their work in developing a named entity recognition (NER) model for digitised archival data, specifically state authority documents. The entities for the model were chosen based on surveying different user groups. In addition to common entities, two new entities were created to identify businesses (FIBC) and archival documents (JON). The NER model was trained by fine-tuning an existing Finnish BERT model. The training data also included modern digitally born texts to achieve good performance with various types of inputs. The finished model performs fairly well with OCR-processed data, achieving an overall F1 score of 0.868, and particularly well with the new entities (F1 scores of 0.89 and 0.97 for JON and FIBC, respectively).

Digital Library: ARCHIVING

Published Online: April 2024

Proceedings Paper

105 26

machine learning
automatic curriculum generation
automatic document sequencing
Document Reading Sequence
Reading order

Md. Manzoor Murshed, Steven J. Simske, Arturo N. Villanueva

DOI

10.2352/EI.2024.36.10.IPAS-252

Volume 36

Issue 10

Captchas on Darknet Marketplaces: Overview and Automated Solvers

Abstract

There are many electronic documents salient to read for each given topic; however, finding a suitable reading order for pedagogical purposes has been underserved historically by the text analytics community. In this research, we propose an automatic reading order generation technique that can suggest a suitable and optimal reading order for curriculum generation quantitatively. It is necessary to read the relevant documents in some logical order to understand the topics clearly. There are many learning pedagogies advanced, so for our purposes we use the author-supplied reading orders of salient content sets for ground truth. Our method suggests the best reading order automatically by checking the relevant topics, document distances, and semantic structure of the given documents. The system will generate a suitable and efficient reading sequence by analyzing the information, similarity, overlap of contents, and distances using word frequency, and topic sets. We measure the similarity, relevance, distance, and overlap of different documents using cosine similarity, entropy relevance, Euclidean distances, and Jaccard similarities respectively. We propose an algorithm that will generate the best possible reading order for a set of given documents. We evaluated the performance of our system against the ground truth reading order using different kinds of textbooks and generalized the finding for any given set of documents.

Digital Library: EI

Published Online: January 2024

Proceedings

231 109

captcha solving
darknet marketplaces
machine learning

York Yannikos, Julian Heeger

DOI

10.2352/EI.2024.36.4.MWSF-330

Volume 36

Issue 4

Abstract

Captchas are used on many websites in the Internet to prevent automated web requests. Likewise, marketplaces in the darknet commonly use captchas to secure themselves against DDoS attacks and automated web scrapers. This complicates research and investigations regarding the content and activity of darknet marketplaces. In this work we focus on the darknet and provide an overview about the variety of captchas found in darknet marketplaces. We propose a workflow and recommendations for building automated captcha solvers and present solvers based on machine learning models for 5 different captcha types we found. With our solvers we were able to achieve accuracies between 65% and 99% which significantly improved our ability to collect data from the corresponding marketplaces with automated web scrapers.

Digital Library: EI

Published Online: January 2024

Data Extraction, Visualization, and Storytelling: A Case Study in Headline Analysis on "The Hongkong News" with Deep Learning

104 37

data mining
data visualization
AI
machine learning
storytelling

Michael Kin-Fu Yip, Vincent Wai-Yip Lum

DOI

10.2352/issn.2168-3204.2023.20.1.6

Volume 20

Issue 1

Abstract

This paper presents the methodologies to extract the headline and illustrations from a historical newspaper for storytelling to support digital scholarship. It explored the ways in which new digital tools can facilitate the understanding of the newspaper content in the setting of time and space, "The Hongkong News" was selected from Hong Kong Early Tabloid Newspaper for the case study owing to its uniqueness in historical value towards the scholars. The proposed methodologies were evaluated in OCR (Optical Character Recognition) with scraping and Deep Learning Object Detection models. Two visualization products were developed to showcase the feasibility of our proposed methods to serve the storytelling purpose.

Digital Library: ARCHIVING

Published Online: June 2023

Toward an Ontological Model for Audiovisual Archive Datafication

36 7

Metadata management
ontology
machine learning
digitisation
audiovisual archive

Yuchen Yang

DOI

10.2352/issn.2168-3204.2023.20.1.20

Volume 20

Issue 1

Autonomous Vehicles and Machines 2023 Conference Overview and Papers Program

Abstract

With the initiatives like Collections as Data and Computational Archival Science, archives are no longer seen as a static documentation of objects, but evolving sources of cultural and historical data. This work emphasizes the potential updates in preserving and documenting digital audiovisual (AV) content from a data perspective, considering the recent developments in natural language processing and computer vision tasks, as well as the emergence of interactive and embodied experiences and interfaces for innovatively accessing archival content. As part of Swiss national scientific fund Sinergia project, this work was able to work end-to-end with real-world AV archives like Télévision Suisse Romande (RTS). Resorting to an updated narrative model for mapping data that can be obtained from the content as well as the consumer, this work proposed an experimental attempt to build an ontology to formally sum up the potential new paradigm for preservation and accessibility from a data perspective for modern archives, in the hope for nurturing a digital and data-driven mind-set for archive practices.

Digital Library: ARCHIVING

Published Online: June 2023

Front Matter

62 33

autonomous vehicles
perception
machine learning
deep learning
sensors and processors

DOI

10.2352/EI.2023.35.16.AVM-A16

Volume 35

Issue 16

Abstract

Advancements in sensing, computing, image processing, and computer vision technologies are enabling unprecedented growth and interest in autonomous vehicles and intelligent machines, from self-driving cars to unmanned drones, to personal service robots. These new capabilities have the potential to fundamentally change the way people live, work, commute, and connect with each other, and will undoubtedly provoke entirely new applications and commercial opportunities for generations to come. The main focus of AVM is perception. This begins with sensing. While imaging continues to be an essential emphasis in all EI conferences, AVM also embraces other sensing modalities important to autonomous navigation, including radar, LiDAR, and time of flight. Realization of autonomous systems also includes purpose-built processors, e.g., ISPs, vision processors, DNN accelerators, as well core image processing and computer vision algorithms, system design and architecture, simulation, and image/video quality. AVM topics are at the intersection of these multi-disciplinary areas. AVM is the Perception Conference that bridges the imaging and vision communities, connecting the dots for the entire software and hardware stack for perception, helping people design globally optimized algorithms, processors, and systems for intelligent “eyes” for vehicles and machines.

Digital Library: EI

Published Online: January 2023

Active learning approaches to analysis of thin-film printed sensors for determining nitrate levels in soil

79 14

active learning
machine learning
image processing

Xihui Wang, Ali Shakouri, Bruno Ribeiro, George T.C. Chiu, Jan P. Allebach

DOI

10.2352/EI.2023.35.15.COLOR-194

Volume 35

Issue 15

Abstract

In order to train a learning-based prediction model, large datasets are typically required. One of the major restrictions of machine learning applications using customized databases is the cost of human labor. In the previous papers [3, 4, 5], it is demonstrated through experiments that the correlation between thin-film nitrate sensor performance and surface texture exists. In the previous papers, several methods for extracting texture features from sensor images are explored, repeated cross-validation and a hyperparameter auto-tuning method are performed, and several machine learning models are built to improve prediction accuracy. In this paper, a new way to achieve the same accuracy with a much smaller dataset of labels by using an active learning structure is presented.

Digital Library: EI

Published Online: January 2023

Machine learning estimation of camera spectral sensitivity functions with non-RGB color filters

123 36

spectral sensitivity functions
machine learning
color filter arrays
estimation

Abraham Sachs, Ramakrishna Kakarala

DOI

10.2352/EI.2023.35.15.COLOR-199

Volume 35

Issue 15

Abstract

The spectral sensitivity functions of a digital image sensor determine the sensor’s color response to scene-radiated light. Knowing these spectral sensitivity functions is very important for applications that require accurate color, such as computer vision. Traditional measurements of these functions are time consuming, and require expensive lab equipment to generate narrow-band monochromatic light. Previous works have shown that sensitivity curves can be estimated using images of a color checker chart with known spectral reflectances, using either numerical optimization or machine learning. However, previous works in the literature have not considered sensitivity functions for CFAs (color filter arrays) other than RGB, such as RCCB (Red Clear Blue) or RYYCy (Red Yellow Cyan). Non-RGB CFAs have been shown to be useful for automotive and security camera applications, especially in low light situations. We propose a machine learning method to estimate the sensitivity curves of sensors with non-RGB filters, in addition to the RGB filters addressed previously in the literature, using a single image of a color chart under unknown illumination. Including non-RGB filters makes the estimation problem much more challenging, since the resulting space of color filters is no longer modelled by simple Gaussian shapes.

Digital Library: EI

Published Online: January 2023

Evaluating active learning for blind imbalanced domains

190 36

machine learning
deep learning
active learning
data augmentation
out of distribution
blind imbalanced domains

Hiroshi Kuwajima, Masayuki Tanaka, Masatoshi Okutomi

DOI

10.2352/EI.2023.35.9.IPAS-288

Volume 35

Issue 9

Imaging and Multimedia Analytics at the Edge 2023 Conference Overview and Papers Program

Abstract

Deep learning, which has been very successful in recent years, requires a large amount of data. Active learning has been widely studied and used for decades to reduce annotation costs and now attracts lots of attention in deep learning. Many real-world deep learning applications use active learning to select the informative data to be annotated. In this paper, we first investigate laboratory settings for active learning. We show significant gaps between the results from different laboratory settings and describe our practical laboratory setting that reasonably reflects the active learning use cases in real-world applications. Then, we introduce a problem setting of blind imbalanced domains. Any data set includes multiple domains, e.g., individuals in handwritten character recognition with different social attributes. Major domains have many samples, and minor domains have few samples in the training set. However, we must accurately infer both major and minor domains in the test phase. We experimentally compare different methods of active learning for blind imbalanced domains in our practical laboratory setting. We show that a simple active learning method using softmax margin and a model training method using distance-based sampling with center loss, both working in the deep feature space, perform well.

Digital Library: EI

Published Online: January 2023

Front Matter

42 21

multimedia analysis
machine learning
mobile imaging
web imaging
Internet of Things edge imaging

DOI

10.2352/EI.2023.35.7.IMAGE-A07

Volume 35

Issue 7