TamperedNews & News400 (IJMIR'21 Update)

Dataset: DatensatzDataset

Personen

  • Eric Müller-Budack (Urheber*in)
  • Jonas Theiner (Urheber*in)
  • Sebastian Diering (Urheber*in)
  • Maximilian Idahl (Urheber*in)
  • Sherzod Hakimov (Urheber*in)
  • Ralph Ewerth (Urheber*in)

Externe Organisationen

  • Technische Informationsbibliothek (TIB) Leibniz-Informationszentrum Technik und Naturwissenschaften und Universitätsbibliothek

Details

Datum der Bereitstellung2022
Herausgeber (Verlag)Forschungsdaten-Repositorium der LUH

Beschreibung

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

Content
For both datasets TamperedNews and News400, we provide the:

*dataset*.tar.gz containing the *dataset*.jsonl with
Web links to the news texts
Web links to the news image
Outputs of the named entity recognition and disambiguation (NERD) approach
Untampered and tampered entities
*dataset*_features.tar.gzwith visual features for events, locations, and persons
news400_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts of the News400 dataset

Please note that the word embeddings of the TamperedNews dataset (tamperednews_wordembeddings.tar.gz) have been already provided in the first version (Link).

For all entities detected in both datasets, we provide:

entities.tar.gz containing an *entity_type*.jsonl for all entity types (events, locations, and persons) with:
Wikidata ID
Wikidata label
Meta information used for tampering
Web links to all reference images crawled from Google, Bing, and Wikidata
entities_features.tar.gz containing the visual features of the reference images for all entities

Source Code

The source code to reproduce our results as well as download scripts to crawl news texts and images can be found on our GitHub page: https://github.com/TIBHannover/cross-modal_entity_consistency