Journal Description

Data

Data is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.

Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
High Visibility: indexed within Scopus, ESCI (Web of Science), dblp, Inspec, RePEc, and other databases.
Journal Rank: CiteScore - Q2 (Information Systems and Management)
Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 22 days after submission; acceptance to publication is undertaken in 3.9 days (median values for papers published in this journal in the second half of 2023).
Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.

Impact Factor: 2.6 (2022); 5-Year Impact Factor: 3.0 (2022)

Imprint Information Journal Flyer Open Access ISSN: 2306-5729

Latest Articles

10 pages, 1827 KiB

Open AccessData Descriptor

Proteomic and Metabolomic Analyses of the Blood Samples of Highly Trained Athletes

Kristina A. Malsagova

Alexander A. Stepanov

Liudmila I. Kulikova

Vladimir R. Rudnev

and

Anna L. Kaysheva

Data 2024, 9(1), 15; https://doi.org/10.3390/data9010015 (registering DOI) - 16 Jan 2024

Abstract

High exercise loading causes intricate and ambiguous proteomic and metabolic changes. This study aims to describe the dataset on protein and metabolite contents in plasma samples collected from highly trained athletes across different sports disciplines. The proteomic and metabolomic analyses of the plasma samples of highly trained athletes engaged in sports disciplines of different intensities were carried out using HPLC-MS/MS. The results are reported as two datasets (proteomic data in a derived mgf-file and metabolomic data in processed format), each containing the findings obtained by analyzing 93 mass spectra. Variations in the protein and metabolite contents of the biological samples are observed, depending on the intensity of training load for different sports disciplines. Mass spectrometric proteomic and metabolomic studies can be used for classifying different athlete phenotypes according to the intensity of sports discipline and for the assessment of the efficiency of the recovery period. Full article

► Show Figures

Figure 1

28 pages, 1002 KiB

Open AccessArticle

GeMSyD: Generic Framework for Synthetic Data Generation

Ramona Tolas

Raluca Portase

and

Rodica Potolea

Data 2024, 9(1), 14; https://doi.org/10.3390/data9010014 - 11 Jan 2024

Abstract

In the era of data-driven technologies, the need for diverse and high-quality datasets for training and testing machine learning models has become increasingly critical. In this article, we present a versatile methodology, the Generic Methodology for Constructing Synthetic Data Generation (GeMSyD), which addresses the challenge of synthetic data creation in the context of smart devices. GeMSyD provides a framework that enables the generation of synthetic datasets, aligning them closely with real-world data. To demonstrate the utility of GeMSyD, we instantiate the methodology by constructing a synthetic data generation framework tailored to the domain of event-based data modeling, specifically focusing on user interactions with smart devices. Our framework leverages GeMSyD to create synthetic datasets that faithfully emulate the dynamics of human–device interactions, including the temporal dependencies. Furthermore, we showcase how the synthetic data generated using our framework can serve as a valuable resource for machine learning practitioners. By employing these synthetic datasets, we perform a series of experiments to evaluate the performance of a neural-network-based prediction model in the domain of smart device interaction. Our results underscore the potential of synthetic data in facilitating model development and benchmarking. Full article

► Show Figures

Figure 1

23 pages, 1346 KiB

Open AccessReview

Adaptive Forecasting in Energy Consumption: A Bibliometric Analysis and Review

Manuel Jaramillo

Wilson Pavón

and

Lisbeth Jaramillo

Data 2024, 9(1), 13; https://doi.org/10.3390/data9010013 - 11 Jan 2024

Abstract

This paper addresses the challenges in forecasting electrical energy in the current era of renewable energy integration. It reviews advanced adaptive forecasting methodologies while also analyzing the evolution of research in this field through bibliometric analysis. The review highlights the key contributions and limitations of current models with an emphasis on the challenges of traditional methods. The analysis reveals that Long Short-Term Memory (LSTM) networks, optimization techniques, and deep learning have the potential to model the dynamic nature of energy consumption, but they also have higher computational demands and data requirements. This review aims to offer a balanced view of current advancements and challenges in forecasting methods, guiding researchers, policymakers, and industry experts. It advocates for collaborative innovation in adaptive methodologies to enhance forecasting accuracy and support the development of resilient, sustainable energy systems. Full article

► Show Figures

Figure 1

9 pages, 1840 KiB

Open AccessData Descriptor

DeepSpaceYoloDataset: Annotated Astronomical Images Captured with Smart Telescopes

Olivier Parisot

Data 2024, 9(1), 12; https://doi.org/10.3390/data9010012 - 10 Jan 2024

Abstract

Recent smart telescopes allow the automatic collection of a large quantity of data for specific portions of the night sky—with the goal of capturing images of deep sky objects (nebula, galaxies, globular clusters). Nevertheless, human verification is still required afterwards to check whether celestial targets are effectively visible in the images produced by these instruments. Depending on the magnitude of deep sky objects, the observation conditions and the cumulative time of data acquisition, it is possible that only stars are present in the images. In addition, unfavorable external conditions (light pollution, bright moon, etc.) can make capture difficult. In this paper, we describe DeepSpaceYoloDataset, a set of 4696 RGB astronomical images captured by two smart telescopes and annotated with the positions of deep sky objects that are effectively in the images. This dataset can be used to train detection models on this type of image, enabling the better control of the duration of capture sessions, but also to detect unexpected celestial events such as supernova. Full article

► Show Figures

Figure 1

16 pages, 2257 KiB

Open AccessArticle

ADAS Simulation Result Dataset Processing Based on Improved BP Neural Network

Songyan Zhao

Lingshan Chen

and

Yongchao Huang

Data 2024, 9(1), 11; https://doi.org/10.3390/data9010011 - 05 Jan 2024

Abstract

The autonomous driving simulation field lacks evaluation and forecasting systems for simulation results. The data obtained from the simulation of target algorithms and vehicle models cannot be reasonably estimated. This problem affects subsequent vehicle improvement and parameter calibration. The authors relied on the simulation results of the AEB algorithm. We selected the BP Neural Network as the basis and improved it with a genetic algorithm optimized via a roulette algorithm. The regression evaluation indicators of the prediction results show that the GA-BP neural network has better prediction accuracy and generalization ability than the original BP neural network and other optimized BP neural networks. This GA-BP neural network also fills the Gap in Evaluation and Prediction Systems. Full article

► Show Figures

Figure 1

15 pages, 5777 KiB

Open AccessArticle

Experimental Dataset of Tunable Mode Converter Based on Long-Period Fiber Gratings Written in Few-Mode Fiber: Impacts of Thermal, Wavelength, and Polarization Variations

Juan Soto-Perdomo

Erick Reyes-Vera

Jorge Montoya-Cardona

and

Pedro Torres

Data 2024, 9(1), 10; https://doi.org/10.3390/data9010010 - 31 Dec 2023

Abstract

Mode division multiplexing (MDM) is currently one of the most attractive multiplexing techniques in optical communications, as it allows for an increase in the number of channels available for data transmission. Optical modal converters are one of the main devices used in this technique. Therefore, the characterization and improvement of these devices are of great current interest. In this work, we present a dataset of 49,736 near-field intensity images of a modal converter based on a long-period fiber grating (LPFG) written on a few-mode fiber (FMF). This characterization was performed experimentally at various wavelengths, polarizations, and temperature conditions when the device converted from

{LP}_{01}

mode to

{LP}_{11}

mode. The results show that the modal converter can be tuned by adjusting these parameters, and that its operation is optimal under specific circumstances which have a great impact on its performance. Additionally, the potential application of the database is validated in this work. A modal decomposition technique based on the particle swarm algorithm (PSO) was employed as a tool for determining the most effective combinations of modal weights and relative phases from the spatial distributions collected in the dataset. The proposed dataset can open up new opportunities for researchers working on image segmentation, detection, and classification problems related to MDM technology. In addition, we implement novel artificial intelligence techniques that can help in finding the optimal operating conditions for this type of device. Full article

► Show Figures

Figure 1

26 pages, 6610 KiB

Open AccessArticle

Wi-Gitation: Replica Wi-Fi CSI Dataset for Physical Agitation Activity Recognition

Nikita Sharma

Jeroen Klein Brinke

L. M. A. Braakman Jansen

Paul J. M. Havinga

and

Duc V. Le

Data 2024, 9(1), 9; https://doi.org/10.3390/data9010009 - 30 Dec 2023

Abstract

Agitation is a commonly found behavioral condition in persons with advanced dementia. It requires continuous monitoring to gain insights into agitation levels to assist caregivers in delivering adequate care. The available monitoring techniques use cameras and wearables which are distressful and intrusive and are thus often rejected by older adults. To enable continuous monitoring in older adult care, unobtrusive Wi-Fi channel state information (CSI) can be leveraged to monitor physical activities related to agitation. However, to the best of our knowledge, there are no realistic CSI datasets available for facilitating the classification of physical activities demonstrated during agitation scenarios such as disturbed walking, repetitive sitting–getting up, tapping on a surface, hand wringing, rubbing on a surface, flipping objects, and kicking. Therefore, in this paper, we present a public dataset named Wi-Gitation. For Wi-Gitation, the Wi-Fi CSI data were collected with twenty-three healthy participants depicting the aforementioned agitation-related physical activities at two different locations in a one-bedroom apartment with multiple receivers placed at different distances (0.5–8 m) from the participants. The validation results on the Wi-Gitation dataset indicate higher accuracies (

F_{1}

-Scores

\geq 0.95

) when employing mixed-data analysis, where the training and testing data share the same distribution. Conversely, in scenarios where the training and testing data differ in distribution (i.e., leave-one-out), the accuracies experienced a notable decline (

F_{1}

-Scores

\leq 0.21

). This dataset can be used for fundamental research on CSI signals and in the evaluation of advanced algorithms developed for tackling domain invariance in CSI-based human activity recognition. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

12 pages, 1825 KiB

Open AccessData Descriptor

DNA Methylome and Transcriptome Maps of Primary Colorectal Cancer and Matched Liver Metastasis

Priyadarshana Ajithkumar

and

Data 2024, 9(1), 8; https://doi.org/10.3390/data9010008 - 29 Dec 2023

Cited by 1

Abstract

Sequencing-based genome-wide DNA methylation, gene expression studies and associated data on paired colorectal cancer (CRC) primary and liver metastasis are very limited. We have profiled the DNA methylome and transcriptome of matched primary CRC and liver metastasis samples from the same patients. Genome-scale methylation and expression levels were examined using Reduced Representation Bisulfite Sequencing (RRBS) and RNA-Seq, respectively. To investigate DNA methylation and expression patterns, we generated a total of 1.01 × 10⁹ RRBS reads and 4.38 × 10⁸ RNA-Seq reads from the matched cancer tissues. Here, we describe in detail the sample features, experimental design, methods and bioinformatic pipeline for these epigenetic data. We demonstrate the quality of both the samples and sequence data obtained from the paired samples. The sequencing data obtained from this study will serve as a valuable resource for studying underlying mechanisms of distant metastasis and the utility of epigenetic profiles in cancer metastasis. Full article

► Show Figures

Figure 1

20 pages, 4338 KiB

Open AccessArticle

Data-Driven Analysis of MRI Scans: Exploring Brain Structure Variations in Colombian Adolescent Offenders

Germán Sánchez-Torres

Nallig Leal

and

Mariana Pino

Data 2024, 9(1), 7; https://doi.org/10.3390/data9010007 - 26 Dec 2023

Abstract

With the advancements in neuroimaging techniques, understanding the relationship between brain morphology and behavioral tendencies such as criminal behavior has garnered interest. This research addresses the investigation of disparities in neuroanatomical structures between adolescent offenders and non-offenders and considers the implications of such distinctions regarding offender behavior within adolescent populations. Employing data-driven methodologies, MRI scans of adolescents from Barranquilla, Colombia, were analyzed to explore morphological variations. Utilizing a 1.5 Tesla Siemens resonator (Siemens Healthineers, Erlangen, Germany), T1-weighted MPRAGE anatomical images were acquired and analyzed using a systematic five-step methodology including data acquisition, MRI pre-processing, feature selection, model selection, and model validation and evaluation. Participants, both offenders and non-offenders, were aged 14–18 and selected based on education, criminal history, and physical conditions. The research identified significant disparities in the volumes of 42 brain structures between adolescent offenders (AOs) and non-offenders (NOs), highlighting particular brain regions potentially associated with offending behavior. Additionally, a considerable proportion of AOs emanated from lower socioeconomic backgrounds and showcased marked substance use. The findings suggest that neuroanatomical disparities potentially correlate with criminal behavior among adolescents at a neurobiological level. Noticeable socio-environmental factors, such as lower socioeconomic status and substance abuse, were substantially prevalent among AOs. Particularly, neurobiological deviations in structures like ctx-lh-rostralmiddlefrontal and ctx-lh-caudalanteriorcingulate perhaps represent a link between neurological factors and external stimuli. Full article

► Show Figures

Figure 1

21 pages, 1932 KiB

Open AccessArticle

A Profit Maximization Model for Data Consumers with Data Providers’ Incentives in Personal Data Trading Market

Hyojin Park

Hyeontaek Oh

and

Jun Kyun Choi

Data 2024, 9(1), 6; https://doi.org/10.3390/data9010006 - 25 Dec 2023

Abstract

This paper proposes a profit maximization model for a data consumer when it buys personal data from data providers (by obtaining consent) through data brokers and provides their new services to data providers (i.e., service consumers). To observe the behavioral models of data providers, the data consumer, and service consumers, this paper proposes the willingness-to-sell model of personal data of data providers (which is affected by data providers’ behavior related to explicit consent), the service quality model obtained by the collected personal data from the data consumer’s perspective, and the willingness-to-pay model of service consumers regarding provided new services from the data consumer. Particularly, this paper jointly considers the behavior of data providers and service users under a limited budget. With parameters inspired by real-world surveys on data providers, this paper shows various numerical results to check the feasibility of the proposed models. Full article

(This article belongs to the Section Information Systems and Data Management)

9 pages, 2088 KiB

Open AccessData Descriptor

Single-Nucleotide Variants in PADI2 and PADI4 and Ancestry Informative Markers in Interstitial Lung Disease and Rheumatoid Arthritis among a Mexican Mestizo Population

Karol J. Nava-Quiroz

Jorge Rojas-Serrano

Gloria Pérez-Rubio

Ivette Buendia-Roldan

Mayra Mejía

Juan Carlos Fernández-López

Espiridión Ramos-Martínez

Luis A. López-Flores

Alma D. Del Ángel-Pablo

and

Ramcés Falfán-Valencia

Data 2024, 9(1), 5; https://doi.org/10.3390/data9010005 - 25 Dec 2023

Abstract

Rheumatoid arthritis (RA) is an autoimmune disease mainly characterized by joint inflammation. It presents extra-articular manifestations, with the lungs being one of the affected areas. Among these, damage to the pulmonary interstitium (Interstitial Lung Disease—ILD) has been linked to proteins involved in the inflammatory process and related to extracellular matrix deposition and lung fibrosis establishment. Peptidyl arginine deiminase enzymes (PAD), which carry out protein citrullination, play a role in this context. A genetic association analysis was conducted on genes encoding two PAD isoforms: PAD2 and PAD4. This analysis also included ancestry informative markers and protein level determination in samples from patients with RA, RA-associated ILD, and clinically healthy controls. Significant single nucleotide variants (SNV) and one haplotype were identified as susceptibility factors for RA-ILD development. Elevated levels of PAD4 were found in RA-ILD cases, while PADI2 showed an association with RA susceptibility. This work presents data obtained from previously published research. Population variability has been noticed in genetic association studies. We present data for 14 SNVs that show geographical and genetic variation across the Mexican population, which provides highly informative content and greater intrapopulation genetic diversity. Further investigations in the field should be considered in addition to AIMs. The data presented in this study were analyzed in association with SNV genotypes in PADI2 and PADI4 to assess susceptibility to ILD in RA, as well as with changes in PAD2 and PAD4 protein levels according to carrier genotype, in addition to the use of covariates such as ancestry markers. Full article

► Show Figures

Figure 1

12 pages, 4901 KiB

Open AccessData Descriptor

An Urban Traffic Dataset Composed of Visible Images and Their Semantic Segmentation Generated by the CARLA Simulator

Sergio Bemposta Rosende

David San José Gavilán

Javier Fernández-Andrés

and

Javier Sánchez-Soriano

Data 2024, 9(1), 4; https://doi.org/10.3390/data9010004 - 24 Dec 2023

Abstract

A dataset of aerial urban traffic images and their semantic segmentation is presented to be used to train computer vision algorithms, among which those based on convolutional neural networks stand out. This article explains the process of creating the complete dataset, which includes the acquisition of the images, the labeling of vehicles, pedestrians, and pedestrian crossings as well as a description of the structure and content of the dataset (which amounts to 8694 images including visible images and those corresponding to the semantic segmentation). The images were generated using the CARLA simulator (but were like those that could be obtained with fixed aerial cameras or by using multi-copter drones) in the field of intelligent transportation management. The presented dataset is available and accessible to improve the performance of vision and road traffic management systems, especially for the detection of incorrect or dangerous maneuvers. Full article

► Show Figures

Figure 1

20 pages, 3448 KiB

Open AccessArticle

Unlocking Insights: Analysing COVID-19 Lockdown Policies and Mobility Data in Victoria, Australia, through a Data-Driven Machine Learning Approach

and

Data 2024, 9(1), 3; https://doi.org/10.3390/data9010003 - 21 Dec 2023

Abstract

The state of Victoria, Australia, implemented one of the world’s most prolonged cumulative lockdowns in 2020 and 2021. Although lockdowns have proven effective in managing COVID-19 worldwide, this approach faced challenges in containing the rising infection in Victoria. This study evaluates the effects of short-term (less than 60 days) and long-term (more than 60 days) lockdowns on public mobility and the effectiveness of various social restriction measures within these periods. The aim is to understand the complexities of pandemic management by examining various measures over different lockdown durations, thereby contributing to more effective COVID-19 containment methods. Using restriction policy, community mobility, and COVID-19 data, a machine-learning-based simulation model was proposed, incorporating analysis of correlation, infection doubling time, and effective lockdown date. The model result highlights the significant impact of public event cancellations in preventing COVID-19 infection during short- and long-term lockdowns and the importance of international travel controls in long-term lockdowns. The effectiveness of social restriction was found to decrease significantly with the transition from short to long lockdowns, characterised by increased visits to public places and increased use of public transport, which may be associated with an increase in the effective reproduction number (R_t) and infected cases. Full article

► Show Figures

Figure 1

14 pages, 283 KiB

Open AccessArticle

Medical Opinions Analysis about the Decrease of Autopsies Using Emerging Pattern Mining

Isaac Machorro-Cano

Ingrid Aylin Ríos-Méndez

José Antonio Palet-Guzmán

Nidia Rodríguez-Mazahua

Lisbeth Rodríguez-Mazahua

Giner Alor-Hernández

and

José Oscar Olmedo-Aguirre

Data 2024, 9(1), 2; https://doi.org/10.3390/data9010002 - 21 Dec 2023

Abstract

An autopsy is a widely recognized procedure to guarantee ongoing enhancements in medicine. It finds extensive application in legal, scientific, medical, and research domains. However, declining autopsy rates in hospitals constitute a worldwide concern. For example, the Regional Hospital of Rio Blanco in Veracruz, Mexico, has substantially reduced the number of autopsies at hospitals in recent years. Since there are no documented historical records of a decrease in the frequency of autopsy cases, it is crucial to establish a methodological framework to substantiate any actual trends in the data. Emerging pattern mining (EPM) allows for finding differences between classes or data sets because it builds a descriptive data model concerning some given remarkable property. Data set description has become a significant application area in various contexts in recent years. In this research study, various EPM (emerging pattern mining) algorithms were used to extract emergent patterns from a data set collected based on medical experts’ perspectives on reducing hospital autopsies. Notably, the top-performing EPM algorithms were iEPMiner, LCMine, SJEP-C, Top-k minimal SJEPs, and Tree-based JEP-C. Among these, iEPMiner and LCMine demonstrated faster performance and produced superior emergent patterns when considering metrics such as Confidence, Weighted Relative Accuracy Criteria (WRACC), False Positive Rate (FPR), and True Positive Rate (TPR). Full article

26 pages, 5854 KiB

Open AccessData Descriptor

Expert-Annotated Dataset to Study Cyberbullying in Polish Language

and

Data 2024, 9(1), 1; https://doi.org/10.3390/data9010001 - 20 Dec 2023

Abstract

We introduce the first dataset of harmful and offensive language collected from the Polish Internet. This dataset was meticulously curated to facilitate the exploration of harmful online phenomena such as cyberbullying and hate speech, which have exhibited a significant surge both within the Polish Internet as well as globally. The dataset was systematically collected and then annotated using two approaches. First, it was annotated by two proficient layperson volunteers, operating under the guidance of a specialist in the language of cyberbullying and hate speech. To enhance the precision of the annotations, a secondary round of annotations was carried out by a team of adept annotators with specialized long-term expertise in cyberbullying and hate speech annotations. This second phase was further overseen by an experienced annotator, acting as a super-annotator. In its initial application, the dataset was leveraged for the categorization of cyberbullying instances in the Polish language. Specifically, the dataset serves as the foundation for two distinct tasks: (1) a binary classification that segregates harmful and non-harmful messages and (2) a multi-class classification that distinguishes between two variations of harmful content (cyberbullying and hate speech), as well as a non-harmful category. Alongside the dataset itself, we also provide the models that showed satisfying classification performance. These models are made accessible for third-party use in constructing cyberbullying prevention systems. Full article

► Show Figures

Figure 1

4 pages, 416 KiB

Open AccessData Descriptor

Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004

Vladimir K. Chebotar

Maria S. Gancheva

Elena P. Chizhevskaya

Maria E. Baganova

Oksana V. Keleinikova

Kharon A. Husainov

and

Veronika N. Pishchik

Data 2023, 8(12), 187; https://doi.org/10.3390/data8120187 - 09 Dec 2023

Abstract

We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a G+C% content of 70.89%. We found that Curtobacterium flaccumfaciens strain W004 could promote the growth of spring wheat plants, resulting in an increase in grain yield of 54.3%. Sequencing the genome of this new strain can provide insights into its potential role in plant–microbe interactions. Full article

► Show Figures

Figure 1

19 pages, 11983 KiB

Open AccessData Descriptor

A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia

Juan Felipe Valencia-Mosquera

David Griol

Mayra Solarte-Montoya

Cristhian Figueroa

Juan Carlos Corrales

and

David Camilo Corrales

Data 2023, 8(12), 186; https://doi.org/10.3390/data8120186 - 08 Dec 2023

Abstract

This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected weekly from September 2020 to August 2021. The qualitative dataset includes climatic conditions, productive activities, external conditions, and coffee bio-aggressors. This dataset allows researchers to find patterns for coffee crop protection through the ancestral knowledge not detected by real-time agricultural sensors. As far as we are concerned, there are no datasets like the one presented in this paper with similar characteristics of qualitative value that express the empirical knowledge of coffee farmers used to detect triggers of causal behaviors of pests and diseases in coffee crops. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

23 pages, 6555 KiB

Open AccessArticle

Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques

Luisa F. Gomez-Ossa

German Sanchez-Torres

and

John W. Branch-Bedoya

Data 2023, 8(12), 185; https://doi.org/10.3390/data8120185 - 04 Dec 2023

Abstract

Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable data for this region, coupled with challenges arising from its mountainous topography and diverse ecosystems, hinders the description of its coverage. Therefore, this research proposes the Tropical Andes Land Cover Dataset (TALANDCOVER). It is constructed from three sample strategies: aleatory, minimum 50%, and 70% of representation per class, which address imbalanced geographic data. Additionally, the U-Net deep learning model is applied for enhanced and tailored classification of land covers. Using high-resolution data from the NICFI program, our analysis focuses on the Department of Antioquia in Colombia. The TALANDCOVER dataset, presented in TIF format, comprises multiband R-G-B-NIR images paired with six labels (dense forest, grasslands, heterogeneous agricultural areas, bodies of water, built-up areas, and bare-degraded lands) with an estimated 0.76 F1 score compared to ground truth data by expert knowledge and surpassing the precision of existing global cover maps for the study area. To the best of our knowledge, this work is a pioneer in its release of open-source data for segmenting coverages with pixel-wise labeled NICFI imagery at a 4.77 m resolution. The experiments carried out with the application of the sample strategies and models show F1 score values of 0.70, 0.72, and 0.74 for aleatory, balanced 50%, and balanced 70%, respectively, over the expert segmented sample (ground truth), which suggests that the personalized application of our deep learning model, together with the TALANDCOVER dataset offers different possibilities that facilitate the training of deep architectures for the classification of large-scale covers in complex areas, such as the tropical Andes. This advance has significant potential for decision making, emphasizing sustainable land use and the conservation of natural resources. Full article

► Show Figures

Figure 1

12 pages, 7250 KiB

Open AccessData Descriptor

An Urban Image Stimulus Set Generated from Social Media

Ardaman Kaur

André Leite Rodrigues

Sarah Hoogstraten

Diego Andrés Blanco-Mora

Bruno Miranda

Paulo Morgado

and

Dar Meshi

Data 2023, 8(12), 184; https://doi.org/10.3390/data8120184 - 01 Dec 2023

Abstract

Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive data capturing the complexity of urban settings and their influence on human experiences. Here, we share an urban image stimulus set from the city of Lisbon that researchers can use in their experiments. The stimulus set consists of 160 geotagged urban space photographs extracted from the Flickr social media platform. We divided the city into 100 × 100 m cells to calculate the cell image density (number of images in each cell) and the cell green index (Normalized Difference Vegetation Index of each cell) and assigned these values to each geotagged image. In addition, we also computed the popularity of each image (normalized views on the social network). We also categorized these images into two putative groups by photographer status (residents and tourists), with 80 images belonging to each group. With the rise in data-driven decisions in urban planning, this stimulus set helps explore human–urban environment interaction patterns, especially if complemented with survey/neuroimaging measures or machine-learning analyses. Full article

► Show Figures

Figure 1

9 pages, 4934 KiB

Open AccessData Descriptor

Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command

Yeongmin Son

Won Jun Kwak

and

Jae Wan Park

Data 2023, 8(12), 183; https://doi.org/10.3390/data8120183 - 01 Dec 2023

Abstract

This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

Journal Menu

Journal Browser

► Journal Browser

Highly Accessed Articles

Latest Books

More Books and Reprints...

E-Mail Alert

News

2 January 2024
MDPI Insights: The CEO's Letter #7 - Nobel Laureates Entrust MDPI with Their Research

30 November 2023
MDPI Insights: The CEO's Letter #6 - MDPI Spain Summit and ResearchGate

21 November 2023
769 Editorial Board Members of MDPI Journals Achieve Highly Cited Researcher Recognition in 2023

More News & Announcements...

Topics

Propose a Topic

Topic in Data, Future Internet, Information, Mathematics, Symmetry

Application of Deep Learning Method in 6G Communication Technology Topic Editors: Mohamed Abouhawwash, K. Venkatachalam
Deadline: 31 March 2024

Topic in Applied Sciences, Batteries, Buildings, Data, Electricity, Electronics, Energies, Smart Cities

Smart Energy Systems, 2nd Edition Topic Editors: Hugo Morais, Rui Castro, Cindy Guzman
Deadline: 31 May 2024

Topic in Algorithms, Data, Information, Mathematics, Symmetry

Decision-Making and Data Mining for Sustainable Computing Topic Editors: Sunil Jha, Malgorzata Rataj, Xiaorui Zhang
Deadline: 30 November 2024

Topic in BDCC, Data, Environments, Geosciences, Remote Sensing

Database, Mechanism and Risk Assessment of Slope Geologic Hazards Topic Editors: Chong Xu, Yingying Tian, Xiaoyi Shao, Zikang Xiao, Yulong Cui
Deadline: 28 February 2025