Welcome to dsa-tdb’s documentation!
DSA Transparency database tools
dsa-tdb provides a set of tools to work with daily or total dumps coming from the DSA Transparency Database.
Requirements
The Transparency database is a large dataset. As of October 2024, you will require a minimum of:
4.1TB disk space to store the daily dump files as downloaded from the DSA Transparency Database website.
500GB to store the daily dumps in a “chunked” form (see documentation below).
1GB to store the aggregated dataset with the default aggregation configuration.
Overall, the data throughput is in the range of 5 to 10GB per day (meaning you should have as a bare minimum 5GB of free disk space per daily dump you want to process).
The dsa-tdb Python package aims at making working with such a large dataset easier by providing convenience functions to convert from the raw dumps to more efficient data storage as well as scripts to handle the conversion over a sliding time window to reduce the disk space requirements (see documentation below).
Installation
With Docker/podman (recommended)
Install
Dockerorpodman(recommended) on your machine (if usingpodman, replacedockerwithpodmanin the following commands or install thepodman-dockerextension to have compatible CLI). You can also download the Desktop versions for the two (here for Docker and here for podman).[optional] If you need to edit some of the build-time variables, you can clone/download the zip of the repository, cd into it and use the provided
docker-compose.ymlfile. The build allows to customize:The user name, id and group id of the container (defaults are
user,uid=1000andgid=1000)The ports exposed (see the Ports section below).
You can customize them:
Directly in the
docker-composefile, building withpodman-compose build.Or at buildtime with
--build-arg. For instance, this sets the user name, id and group id to the ones of the host user:
docker build --build-arg DOCKER_USER=$(id -un) --build-arg DOCKER_USER_ID=$(id -u) --build-arg DOCKER_GROUP_ID=$(id -g) -t localhost/dsa-tdb-nb .
Build the container using
docker-compose. After customizing the values in thedocker-compose.yaml, simply do adocker-compose build(required once to build the image). If usingpodman-compose, please upgrade it at least to version 1.2.0 withpip install --user --upgrade podman-compose.
NOTE If using podman, be sure to have version podman >=4.3.0. Also, always add the
PODMAN_USERNS=keep-id:uid=1000in front of all thepodman runorpodman-compose up -dcommands, where1000is the container’s user id (the default one, change it if you edited the build). In alternative, you can add the--userns=keep-id:uid:1000to the podman or set an environment variable (PODMAN_USERNS=keep-id:uid:1000). In the desktop application, you can set it in the Security->Specify user namespace to use tab in the Create container interface. This is needed to have the mounted folders being writeable by the container’s user.
A
/cachedirectory can be mounted and it will be thespark.local.dir, that is, where Spark writes its cache. To do so, uncomment the matching line indocker-compose.ymlfile. You should mount there a folder located on a fast volume with a lot of free space (~300 GB per month analyzed if you want to analyze the full database).To start the container interactively, just use a
docker-compose up
NOTE_ The default user in the Docker container is
userwith same user and group ids1000. The latter can be changed to match your user name and id specifyingDOCKER_USER=your_userandDOCKER_USER_ID=1234when building, as outlined above.
Ports The docker container will expose these ports (edit the docker-compose.yml file to change the mapping):
8765the Jupyter lab home4040the Spark status page for the user’s application5555the Celery’s flower dashboard to check the status of your tasks8000the Fastapi webapp (visit the docs to see the API usage)8088the Superset instance. Default credentials:admin/admin.8080the Spark master and thrift server web uis for advanced users.
To stop the containers, just use docker-compose down.
NOTE You can change the mount point of the
datafolder when runningdocker-compose upwith theDOCKER_DATA_DIRenvironment variable, to point to a data folder outside of the Git repository. Example:DOCKER_DATA_DIR=/path/to/data/ podman-compose up
With pip / poetry
We ship a python package providing the command line interface. You can install it with:
pip:
pip install dsa-tdb --index-url https://code.europa.eu/api/v4/projects/943/packages/pypi/simplepoetry:
Add the source:
poetry source add --priority=supplemental code_europa https://code.europa.eu/api/v4/projects/943/packages/pypi/simpleInstall the package:
poetry add --source code_europa dsa-tdb
From source (with poetry)
Install
poetry^1.8on your system with eitherpip install --user poetry>=1.8or other methods.Download and extract the code folder and cd into it.
Create the venv and install the dependencies using
poetry install(with--with devif you also want the jupyter notebook kernel and the developer tools)
Usage
CLI
The package will install a command line interface (cli) installing the command dsa-tdb-cli on your path.
The command has three subcommands:
preprocesswill download the specified daily dumps (eventually filtered by platforms or time window), verify theirSHA1checksum and check for new files, later chunking them in smaller csv or parquet files. Optionally, it will delete the original dumps as they are processed (to save disk space), leaving the sha1 files as a proof of work. This allows to repeately run thepreprocessstep on a daily basis to always have the files in place. The resulting “chunked” files are stored as regular flat csv or parquet files which can be conveniently and efficiently loaded into the data processing pipeline of your choice (Spark, Dask, etc.) without having to go through the complex data structure of the daily dumps (zipped csv files).aggregatewill use a separate configuration file (a template of which is provided in the repo under the Aggregation Configuration Template for the simple version that reproduces the data used in the online dashboard and in a complete version, which is the default used in the Superset dashboard) to perform aggregation, that is, counting the number of Statements of reasons (SoRs) corresponding to a given combination of the fields in the database (such ascontent_date,platform_name,category, etc.):This command will considerably reduce the size of the database by aggregating together similar rows (each statement of reason is a new row in the chunked data files, when they share the same values as defined in the aggregation configuration, they are represented as a single row with an incremented count in the aggregated files).
This command will also write an auxiliary csv file (with the same name of the
out_file_name) containing the files and dates of the daily dumps used for the aggregation.It will also make a copy of the configuration file used in the same folder of the output file with the same name and the
configuration.yamlfile for later reference.If the aggregation mode is set to
appendin the configuration, it will load only the files that are not already in the (possibly existing) dates auxiliary file and will append the aggregated data to the (possibly already existing) file. Note that theappendmode only works if:the schema of the aggregated data is the same as the one from the existing file
and the input files are in the same relative or global path as found in the dates auxiliary file.
and the
parquetoutput format is used.
NOTE: Also note that, if using the
created_atcolumn to group, all the files produced with theappendmode will have to be aggregated again on the desired keys as there is no guarantee that all the SoR from one day are in the corresponding daily dump file.
filterwill use a separate configuration file (a template of which is provided in the repo under the Filtering Configuration Template) to filter the raw SoRs, that is, keeping only the ones respecting all the filters set (in an “AND” fashion).This command will also write an auxiliary csv file (with the same name of the
out_file_name) containing the files and dates of the daily dumps used for the filtering. It will also make a copy of the configuration file used in the same folder of the output file with the same name and theconfiguration.yamlfile for later reference.If the filtering mode is set to
appendin the configuration, it will load only the files that are not already in the (possibly existing) dates auxiliary file and will append the filtered data to the (possibly already existing) file. Note that theappendmode only works if:the schema of the filtered data is the same of the existing file.
and the input files are in the same relative or global path as found in the dates auxiliary file.
and the
parquetoutput format is used.
You can see the help and documentation of the cli command by running dsa-tdb-cli --help or dsa-tdb-cli subcommand --help.
Scripts
The scripts folder contains some examples on how to use the library. They can also be readily used in an automated manner to ingest and process on a daily basis the data dumps (e.g. with a crontask).
There are two examples:
scripts/daily_routine.pyis a script that can be called with the platform name and dump version (fullorlight). Without any further argument, it will:preprocess (download and chunk) all the missing/newest daily dumps from the full version of the daily dumps for all available platforms.
aggregate them using the default configuration.
(optionally) delete the chunked files to save disk space.
scripts/download_platform.pyis a subset of the previous script, it just preprocesses (download and chunk) the file for a specific platform and version (fullorlight).
NOTE: The daily routine script can be called on a daily basis and it will update the files and dumps with the newest ones (leaving the latest as a checkpoint for next run).
Dashboards with Apache Superset
Starting from version 0.5.1, the dsa-tdb package comes with a pre-built dashboard based on Apache Superset. The dashboard allows to visualize the aggregated data for the global, full version of the dumps. These dashboards and the corresponding dataset definitions are located in the superset_exports folder.
The default dashboard expects the aggregated data to be under the /data/tdb_data/global___full/aggregations/aggregated-global-full.parquet directory inside the container.
To view the dashboard:
Launch the docker container with the
docker-compose upusing thedocker-compose.ymlfile provided in the repo, as shown in the section Docker above.Create an aggregated view of the global full dataset, using either the cli or the
daily_routinein the scripts folder. Using the API:Do a prepare with root data folder in
/data/tdb_dataandglobalplatform andfullversion.Do an aggregate with the same root folder,
globalplatform andfullversion, and theoutput fileset to/data/tdb_data/global___full/aggregations/aggregated-global-fullPlease note that this might take a lot of time, so please test the procedure with a short time period first.
Visit the Superset UI at
http://localhost:8088(default username and password areadmin).
Notebooks
An example usage notebook is available in notebooks/Example.ipynb.
Contributing
If you’d like to report issues, suggest code modifications or contribute in any other form, please head to the CONTRIBUTING.md documentation file.
License
dsa-tdb is licensed under European Union Public Licence (EUPL) version 1.2.
See the LICENSE for details.
The data contained in the daily dumps are licensed under the CC BY 4.0 license. See the data release for details.
If you use the data from the DSA Transparency Database for your research work, please cite it using the following information:
European Commission-DG CONNECT, Digital Services Act Transparency Database, Directorate-General for Communications Networks, Content and Technology, 2023.
Documentation
Documentation about the fields and values can be found in the official API documentation.
Interactive online documentation for the package is available on the dsa-tdb page.
Notebooks: