dsa_tdb.fetch module

dsa_tdb.fetch.check_local_storage(root_folder: str, force_sha1_check: bool = False, chunked_file_subfoder: str = 'daily_dumps_chunked') DataFrame

Function to check the local storage for the daily dumps. It will check for consistency between the daily dumps files, the sha1 files and the chunked files, if present. It will also check the sha1 of the files if force_sha1_check is True.

Parameters:
  • root_folder (str) – The root folder where to look for the daily dumps.

  • force_sha1_check (bool, optional) – Whether to force the sha1 check of the files, by default False.

  • chunked_file_subfoder (str, optional) – The name of the subfolder where to look for the chunked files, by default dsa_tdb.types.CHUNKED_FILES_SUBFOLDER_NAME. Do not use this parameter unless you know what you are doing.

Returns:

The DataFrame with the local files and their status. The dataframe will have the following columns:

  • platform: the platform name.

  • version: the version of the daily dump.

  • date: the date of the daily dump.

  • dump_file_name: the original daily dump file name.

  • dump_file_path: the original daily dump file path, if present, otherwise None.

  • dump_sha1_path: the sha1 file name if present, otherwise None.

  • dump_sha1: the sha1 value of the daily dump file if present, otherwise None.

  • dump_sha1_check: whether the sha1 check passed for the daily dump file. This will be True if:
    • force_sha1_check is False and no sha1 file or raw dump is found.

    • the daily dump file is or is not present and the sha1 file is consistent with the one found on the web page (if still listed).

    • force_sha1_check is True and the sha1 check passed.

  • chunked_folder: the chunked folder path if present, otherwise None.

  • chuncked_success: whether the chunking was successful.

  • chunked_folder_sha1: the sha1 value of the chunked folder if present, otherwise None.

  • chunked_sha1_check: whether the sha1 corresponds to the daily dump’s one.

Return type:

pd.DataFrame

dsa_tdb.fetch.chunkFile(csv_zip_in: str | Path, folder_out: str | Path, header_in: bool = True, header_out: bool = True, chunk_size: int = 100000, platforms_to_exclude: List[str] | None = None, overwrite: bool = True, extension: str = '.zip', out_format: str = 'csv', delete_original: bool = False) str

Chunks a csv file into multiple files.

Parameters:
  • csv_zip_in (str) – The input csv zipped file.

  • folder_out (str) – The output folder. This will be populated with a folder named as the csv file without the .zip extension containing the chunks in the format part-0000.csv.gz.

  • header_in (bool, optional) – Whether the input file has a header, by default True

  • header_out (bool, optional) – Whether the output files have a header, by default True

  • chunk_size (int, optional) – The size of the chunks, by default 100000

  • platforms_to_exclude (Union[List[str], None], optional) – The list of the platforms’ names to exclude, by default None (keep all).

  • overwrite (bool, optional) – Whether to overwrite the output folder, by default True

  • extension (str, optional) – The extension of the input file, by default ‘.zip’

  • out_format (str, optional) – The format of the output files, by default ‘csv’. Must be one of [‘csv’, ‘parquet’]. If ‘parquet’ the output files will be named folder_out/basename_csv/part-0000.parquet.

  • delete_original (bool, optional) – Whether to delete the original file, by default False

Returns:

The output folder where the chunks of the file are saved.

Return type:

str

dsa_tdb.fetch.download_file(url: str, local_filename: str | Path, check_sha1: bool = True, BUF_SIZE: int = 65536, raise_on_error: bool = True, n_tries: int = 2) str | None

Function to download a file from a URL. It will check the sha1 if check_sha1 is True.

Parameters:
  • url (str) – The URL of the file to download.

  • local_filename (str) – The output file name.

  • check_sha1 (bool, optional) – Whether to check the sha1 of the file, by default True

  • BUF_SIZE (int, optional) – The buffer size in bits to use when computing sha1, by default 65536

  • raise_on_error (bool, optional) – Whether to raise an exception if the download fails, by default False

  • n_tries (int, optional) – The number of times to try downloading the file before giving up/raising an exception, by default 2

Returns:

The path of the downloaded file. None if the download failed. In this case, a message will be sent to the logger process, the file downloaded will be erased and the function will return None.

Return type:

str, None

dsa_tdb.fetch.fetch_aggregate_month(out_folder: str | Path, platform: str, aggregation: str, month: str, version: str = TDB_dailyDumpsVersion.full, dump_format: TDB_agg_data_format = TDB_agg_data_format.parquet, unzip: bool = True, with_aux_files: bool = True) str | None

Fetched the aggregated data zips and the dates and config files for a specific month (in YYYY-MM-01 format).

Parameters:
  • out_folder (str, Path) – The folder where to put the extrracted files.

  • platform (str) – The name of the platform to fetch.

  • aggregation (str) – complete or simple.

  • month (str) – The YYYY-MM-01 string format of the month to parse (if a datetime, it will be str-f-parsed).

  • dump_format (str) – parquet or csv the aggregations file format to download.

  • unzip (bool) – Unzips the data if True [default].

  • with_aux_files (bool) – Also downloads the dates and config files if True [default].

Returns:

out_file – The path to the monthly zipped data downloaded.

Return type:

str

dsa_tdb.fetch.fetch_aggregates_aux_files(out_folder: str, aggregation: str, platform: str, version: str, dump_format: str, get_dates: bool = True, get_conf: bool = True) str | None

Fetches the aggregates’ advanced files (dates and conf)

dsa_tdb.fetch.fetch_available_platforms(page_url: str = 'https://transparency.dsa.ec.europa.eu/explore-data/download', select_name: str = 'platform_id', select_id: str = 'platform_id', select_class: str = 'ecl-select', return_name: bool = False) dict

Function to fetch the available platforms from the Transparency DB website.

Parameters:
  • page_url (str, optional) – The URL of the daily table. The function will append the query parameters to the URL ‘?page=[0-9]’. By default dsa_tdb.types.DAILY_FILES_TABLE_URL.

  • select_name (str, optional) – The name of the select element to find, by default ‘platform_id’.

  • select_id (str, optional) – The id of the select element to find, by default ‘platform_id’.

  • select_class (str, optional) – The class of the select element to find, by default ‘ecl-select’.

  • return_name (bool, optional) – Whether to return the name of the select element, by default False. If False, the value of the selected platform (the platform uid) is returned. If True, the full name of the selected element is returned.

Returns:

The dictionary of available platforms. The keys are the platform’s sanitized names and the values are the platform’s uids. One special key is ‘global’ which contains the global dump.

Return type:

dict

dsa_tdb.fetch.fetch_daily_table(platform: str = 'global', max_pages: int = 10000) DataFrame

Function to fetch the daily table from the Transparency DB website. It uses pagination and will stop when it finds a page with no data.

Parameters:
  • platform (str, optional) – The platform to fetch the daily table from, by default ‘global’.

  • max_pages (int, optional) – The maximum number of pages to fetch, by default 10000.

Returns:

The daily table with the Date field converted to a datetime.

Return type:

pd.DataFrame

dsa_tdb.fetch.fetch_parquet_data_table(platform: str = 'global', version: str = TDB_dailyDumpsVersion.full, max_pages: int = 10000)

Fetches the table if parquet files already available on the bucket.

This function fetches the table for a given platform and version, and then fetches the web CSV and parquet SHA1 checksums for each row in the table.

Parameters:
  • platform (str, optional) – The name of the platform to fetch the table for. Defaults to T.ALL_PLATFORMS_PLATFORM_NAME.

  • version (str, optional) – The version of the daily dumps to fetch the table for. Defaults to TDB_dailyDumpsVersion.full.

  • max_pages (int, optional) – The maximum number of pages to fetch, by default 10000.

Returns:

The fetched table with the CSV and parquet SHA1 checksums.

Return type:

pandas.DataFrame

Notes

This function assumes that the parquet files are already available on the bucket.

Examples

>>> fetch_parquet_data_table('global', 'full')
>>> # returns a pandas DataFrame with the fetched table and SHA1 checksums
dsa_tdb.fetch.get_platform_full_name(platform: str) str
dsa_tdb.fetch.get_platform_uid(platform: str) str
dsa_tdb.fetch.get_platforms_fullnames(platforms: List[str] | None) List[str] | None
dsa_tdb.fetch.prepare_daily_dumps(dump_files_root_folder: str, version: TDB_dailyDumpsVersion = TDB_dailyDumpsVersion.full, platform: str = 'global', platforms_to_exclude: List[str] | None = None, check_sha1: bool = True, force_sha1: bool = False, from_date: date | None = None, to_date: date | None = None, n_processes: int = 1, do_chunking: bool = False, chunk_size: int = 1000000, chunk_format: TDB_chunkFormat = TDB_chunkFormat.parquet, delete_original: bool = False, loglevel: int | str = 20, override_chunked_subfolder: str = 'daily_dumps_chunked', raise_on_error: bool = True) DataFrame
Parameters:
  • dump_files_folder (str) – The root folder where to create a platform___version subfolder where to save the downloaded files.

  • version (str, optional) – The version of the files to download, by default ‘full’. Can be ‘full’ or ‘light’.

  • platform (str, optional) – The platform to download the files from, by default ‘global’. The other platforms’ names can be retrieved visiting the [transparency db page](https://transparency.dsa.ec.europa.eu/data-download) and getting their name. The name will be sanitized to be used as a folder name using the dsa_tdb.fetch.sanitize_platform_name() function.

  • check_sha1 (bool, optional) – Whether to check the sha1 of the downloaded file, by default True.

  • force_sha1 (bool, optional) – Whether to force the sha1 check of the already existing files, by default False.

  • from_date (str, optional) – The date from which to start downloading the files, by default None. It must be in the form YYYY-MM-DD.

  • to_date (str, optional) – The date until which to download the files, by default None. It must be in the form YYYY-MM-DD. Can also be the same as from_date to download only one file.

  • n_processes (int, optional) – The number of processes to use to download the files, by default 1. If > 1, it will use multiprocessing to download the files in parallel.

  • do_chunking (bool, optional) – Whether to contextually chunk the downloaded files, by default False. This can be useful to speed up the processing of the files and to limit the disk space usage.

  • platforms_to_exclude (List[str], optional) – The platforms to exclude from the download, by default None. Here the names of the platforms should be the original ones, as found on the website and without sanitization/manipulations.

  • chunk_size (int, optional) – The size of the chunks in number of SoR, by default 1000000.

  • chunk_format (str, optional) – The format of the chunks, by default ‘parquet’. Can be one of dsa_tdb.types.TDB_chunkFormat.

  • delete_original (bool, optional) – Whether to delete the original file after chunking (will keep SHA1 files for archives purposes), by default False.

  • loglevel (int, optional) – The log level to use, by default logging.INFO.

  • override_chunked_subfolder (str, optional) – The name of the subfolder where to save the chunked files, by default None. If None, the subfolder will be named as dsa_tdb.types.CHUNKED_FILES_SUBFOLDER_NAME. Do not use this parameter unless you know what you are doing.

  • raise_on_error (bool, optional) – If True (default), failing sha1 checks on dumps and chunks will waise an exception. If False, it will try to re-download / re-chunk the failing bits. By default True.

Returns:

The DataFrame with the local files and their status. See the dsa_tdb.fetch.check_local_storage function for more details.

Return type:

pd.DataFrame

dsa_tdb.fetch.read_file_from_url(url: str) str | None

Function to read a file from a URL.

Parameters:

url (str) – The URL of the file to read.

Returns:

The content of the file, if the download was successful. None if the download failed.

Return type:

str, None