dsa_tdb.advanced_utils module

dsa_tdb.advanced_utils.create_advanced_data(root_folder: str, platform: str, version: TDB_dailyDumpsVersion, from_date: str | None = None, to_date: str | None = None, n_workers: int = 0, memory_limit: str | None = None, spark_local_dir: str | None = None, local_build: bool = False, local_folder: str | None = None) None

Preprocess the daily dumps and make the parquet and aggregations to be available online. It will also make a copy of the configuration file used in the same folder of the output file with the same name and the configuration.yaml file for later reference.

Parameters:
  • root_folder (str) – The root folder where to save the raw and aggregated data. WARNING This folder and all its content will be DELETED! Use the local_folder argument below if you want to pass local data.

  • platform (str) – Platform to filter the files from. [global|facebook|…|tiktok|X] The name will be sanitized to be used as a folder name using the dsa_tdb.utils.sanitize_platform_name() function.

  • version (str) – Version of the files to aggregate. [full|light]

  • from_date (str, optional) – Date from which (included) to look at dump files. Format YYYY-MM-DD. If not specified, will start from the first date found in the files.

  • to_date (str, optional) – Date until (included) which to look at dump files. Format YYYY-MM-DD. If not specified, will end at the last date found in the files.

  • n_workers (int, optional) – Number of workers to use for the chunking. If <=0, will use all the available CPUs.

  • memory_limit (str, optional) – The memory limit of the driver of Spark. See https://spark.apache.org/docs/latest/configuration.html#memory-management for more details. Do not change it unless you know what you are doing.

  • spark_local_dir (str, optional) – The local directory where to save the temporary files. If None/empty uses the Spark’s default. Do not change it unless you know what you are doing.

  • local_build (bool, optional) – If True, it will ignore online aggregated data and rebuild the whole aggregation from scratch (from the onlineor local daily dumps). Note that the aggregates configuration files will still be downloaded and used as reference.

  • local_folder (str, optional) – If given, it will use the data from the folder as input (running the preprocesing step to retrieve the files needed). Provide the root folder containing the platfrom___version data folder. In this case the original csv.zip daily dumps will not be deleted.

Return type:

None

dsa_tdb.advanced_utils.download_and_unpack_pqt_zip(tmp_url: str, root_folder: str, delete_original: bool = True)
dsa_tdb.advanced_utils.download_daily_dumps(dump_files_root_folder: str, version: TDB_dailyDumpsVersion = TDB_dailyDumpsVersion.full, platform: str = 'global', from_date: date | None = None, to_date: date | None = None, n_processes: int = 1, delete_original: bool = False, loglevel: int | str = 20) DataFrame

New in 0.7.1. Directly downloads the daily generated parquets to skip the csv download and chunk.

Parameters:
  • dump_files_folder (str) – The root folder where to create a platform___version subfolder where to save the downloaded files.

  • version (str, optional) – The version of the files to download, only ‘full’ supported so far.

  • platform (str, optional) – The platform to download the files from, only ‘global’ supported so far.

  • from_date (str, optional) – The date from which to start downloading the files, by default None. It must be in the form YYYY-MM-DD.

  • to_date (str, optional) – The date until which to download the files, by default None. It must be in the form YYYY-MM-DD. Can also be the same as from_date to download only one file.

  • n_processes (int, optional) – The number of processes to use to download the files, by default 1. If > 1, it will use multiprocessing to download the files in parallel.

  • delete_original (bool, optional) – Whether to delete the original file after chunking (will keep SHA1 files for archives purposes), by default False.

  • loglevel (int, optional) – The log level to use, by default logging.INFO.

Returns:

The DataFrame with the local files and their status. See the dsa_tdb.fetch.check_local_storage function for more details.

Return type:

pd.DataFrame

dsa_tdb.advanced_utils.download_monthly_aggs(output_folder: str, format: TDB_agg_data_format = TDB_agg_data_format.parquet, agg_version: TDB_agg_data_versions = TDB_agg_data_versions.complete, version: TDB_dailyDumpsVersion = TDB_dailyDumpsVersion.full, platform: str = 'global', from_date: date | None = None, to_date: date | None = None, delete_original: bool = False, force_output_rm: bool = False, loglevel: int | str = 20) DataFrame

New in 0.7.1. Directly downloads the pre-generated monthly aggregates to skip the manual aggregation. Note that the program will re-download the data and overwrite anything that is colliding in the output_folder.

Parameters:
  • output_folder (str) – The root folder where to create a platform___version subfolder where to save the downloaded files.

  • format (str, optional) – Format of the aggregates to download (parquet pqt (default) or CSV csv).

  • agg_version (str, optional) – Version of the aggregates to download (complete or simple).

  • version (str, optional) – The version of the files to download, only ‘full’ supported so far.

  • platform (str, optional) – The platform to download the files from, only ‘global’ supported so far.

  • from_date (str, optional) – The date from which to start downloading the files, by default None. It must be in the form YYYY-MM-DD.

  • to_date (str, optional) – The date until which to download the files, by default None. It must be in the form YYYY-MM-DD. Can also be the same as from_date to download only one file.

  • delete_original (bool, optional) – Whether to delete the original file after chunking (will keep SHA1 files for archives purposes), by default False.

  • force_output_rm (bool, optional) – Whether to delete the output folder if it exists. By default False and it will raise a RuntimeError in case the folder is there.

  • loglevel (int, optional) – The log level to use, by default logging.INFO.

Returns:

The DataFrame with the local files and their status. See the dsa_tdb.fetch.check_local_storage function for more details.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the output folder already exists and the force_rm is not set.