The study reviews the State-of-the-Art datasets and solutions for automatic fact-checking and tested their applicability in production environments. Authors of the publication discovered overfitting issues in those models, and proposed a data filtering method that improves the model’s performance and generalization. Then, the scientists designed an unsupervised fine-tuning of the Masked Language models to improve its accuracy working with Wikipedia.
Category: Publications (Page 1 of 5)
This study presents the challenges faced and the solutions adopted while evolving the web-based graphical user interface (GUI) of a tabular data preparation tool from in-memory fitting to Big Data sets. Traditional standalone processing and rendering solutions are no longer usable in a Big Data context.
Wikidata is an outstanding data source with potential application in many scenarios. Wikidata provides its data openly in RDF. This study aims to evaluate the usability of Wikidata as a data source for robots operating on the web of data, according to specifications and practices of linked data, the Semantic Web and ontology reasoning.
OSM-interaction tilesets are vector tiles containing GeoJSON features that represent interactions between mappers (contributors) and objects in OpenStreetMap (OSM). Interactions are abstractions of edits to OSM elements called nodes, ways, and relations. Example interactions are contributors “adjusting the corners of a building” or “re-aligning a road” while the edit to the database is recorded as a “modification to the coordinates of a node.”
In this paper authors present a new concept of geospatial quality assurance that is currently planned to be implemented in the German Federal Agency of Cartography and Geodesy. Linked open data is being enriched with Semantic Web data in order to create thematic maps relevant to the population.
Data-driven security has become essential in many organisations in their attempt to tackle Cyber security incidents. However, whilst the dominant approach to data-driven security remains through the mining of private and internal data, there is an increasing trend towards more open data through the sharing of Cyber security information and experience over public and community platforms. However, some questions remain over the quality and quantity of such open data.
Authors described an approach to improving the quality and interoperability of open data related to small molecules, such as metabolites, drugs, natural products, food additives, and environmental contaminants. The approach involves computer implementation of an extended version of the IUPAC International Chemical Identifier (InChI) system that utilizes the three-dimensional structure of a compound to generate reproducible compound identifiers (standard InChI strings) and universally reproducible designators for all constituent atoms of each compound.
Athors report on the development of a new software tool (AutoQC4Env) for automated quality control (QC) of environmental time series data. Novel features of this tool include a flexible Python software architecture, which makes it easy for users to configure the sequence of tests as well as their statistical parameters, and a statistical concept to assign each value a probability of being a valid data point.
Open Datasets provide one of the most popular ways to acquire insight and information about individuals, organizations and multiple streams of knowledge. Exploring Open Datasets by applying comprehensive and rigorous techniques for data processing can provide the ground for innovation and value for everyone if the data are handled in a legal and controlled way. In this study, authors propose an argumentation and abductive reasoning approach for data processing which is based on the data quality background.
Open Government Data are valuable initiatives in favour of transparency, accountability, and openness. The expectation is to increase participation by engaging citizens, non-profit organisations, and companies in reusing Open Data (OD). A potential barrier in the exploitation of OD and engagement of the target audience is the low quality of available datasets.
