DSA TDB CLI
dsa-tdb-cli
usage: dsa-tdb-cli [-h]
{preprocess,aggregate,filter,advanced_data_pipeline} ...
- -h, --help
show this help message and exit
dsa-tdb-cli advanced_data_pipeline
Internal command to preprocess the csv dumps and make the parquet and aggregations to be available online. WARNING: this command will delete the root folder passed in -d/–input and all its content.
usage: dsa-tdb-cli advanced_data_pipeline [-h] -d INPUT [-p PLATFORM]
[-v VERSION] [-n N_WORKERS]
[-i FROM_DATE] [-f TO_DATE]
[-m MEMORY_LIMIT]
[--spark_local_dir SPARK_LOCAL_DIR]
[--local-build]
[--loglevel LOGLEVEL]
- -h, --help
show this help message and exit
- -d <input>, --input <input>
Root folder where to save the raw and aggregated data. WARNING: this folder and all its content will be DELETED.
- -p <platform>, --platform <platform>
Platform to filter the files from. [global|facebook|…|tiktok|X]
- -v <version>, --version <version>
Version of the files to aggregate. [full|light]
- -n <n_workers>, --n_workers <n_workers>
Number of workers to use for the filtering. If <=0, will use all the available CPUs.
- -i <from_date>, --from_date <from_date>
Date from which (included) to look at dump files. Format YYYY-MM-DD. If not specified, will start from the first date found in the files.
- -f <to_date>, --to_date <to_date>
Date until (included) which to look at dump files. Format YYYY-MM-DD. If not specified, will end at the last date found in the files.
- -m <memory_limit>, --memory_limit <memory_limit>
The driver memory of Spark. See https://spark.apache.org/docs/latest/configuration.html#memory-management for more details. Do not change it unless you know what you are doing.
- --spark_local_dir <spark_local_dir>
The local directory where to save the temporary files. If None/empty uses the Spark’s default. Do not change it unless you know what you are doing.
- --local-build
Ignore web status and build ifles locally. Will still rely on the web-hosted configuration files.
- --loglevel <loglevel>
The logging level. [DEBUG|INFO|WARNING|ERROR|CRITICAL]
dsa-tdb-cli aggregate
Aggregates files into a single file containing the counts for each combination of the attributes. It will also make a copy of the configuration file used in the same folder of the output file with the same name and the configuration.yaml file for later reference. If the aggregation mode is set to append in the configuration, it will load only the dates that are not already in the (possibly existing) dates auxiliary file and will append the aggregated data to the (possibly already existing) file. Note that the append mode only works if:
the schema of the aggregated data is the same of the existing file
and the input files are in the same relative or global path as found in the dates auxiliary file.
and the parquet or csv output format is used.
Warning
Also note that, if using the created_at column to group, all the files produced with the append mode will have to be aggregated again on the desired keys as there is no guarantee that all the SoR from one day are in the corresponding daily dump file.
usage: dsa-tdb-cli aggregate [-h] -d INPUT [-p PLATFORM] [-v VERSION] -o
OUTPUT -c CONFIG [-i FROM_DATE] [-f TO_DATE]
[-n N_WORKERS] [-m MEMORY_LIMIT]
[--spark_local_dir SPARK_LOCAL_DIR]
[--override_chunked_subfolder OVERRIDE_CHUNKED_SUBFOLDER]
[--loglevel LOGLEVEL]
- -h, --help
show this help message and exit
- -d <input>, --input <input>
Root folder containing the different platforms / version directories.
- -p <platform>, --platform <platform>
Platform to aggregate the files from. [global|facebook|…|tiktok|X]
- -v <version>, --version <version>
Version of the files to aggregate. [full|light]
- -o <output>, --output <output>
Output file name in the format name.{}. The placeholder is for the extension.
- -c <config>, --config <config>
The YAML file containing the aggregation configuration. See
dsa_tdb.types.AggregationConfig
for more details.
- -i <from_date>, --from_date <from_date>
Date from which (included) to look at dump files. Format YYYY-MM-DD. If not specified, will start from the first date found in the files.
- -f <to_date>, --to_date <to_date>
Date until (included) which to look at dump files. Format YYYY-MM-DD. If not specified, will end at the last date found in the files.
- -n <n_workers>, --n_workers <n_workers>
Number of workers to use for the aggregation. If <=0, will use all the available CPUs.
- -m <memory_limit>, --memory_limit <memory_limit>
The driver memory of Spark. See https://spark.apache.org/docs/latest/configuration.html#memory-management for more details. Do not change it unless you know what you are doing.
- --spark_local_dir <spark_local_dir>
The local directory where to save the temporary files. If None/empty uses the Spark’s default. Do not change it unless you know what you are doing.
- --override_chunked_subfolder <override_chunked_subfolder>
Override the default chunked subfolder name. Do not change/use it unless you know what you are doing.
- --loglevel <loglevel>
The logging level. [DEBUG|INFO|WARNING|ERROR|CRITICAL]
dsa-tdb-cli filter
Filter the SoRs from the chunked daily dumps. It will save the filtered SoRs in a custom folder in either csv, parquet or pickle format, along with the configuration file used for the filtering for later reference. SoRs can also be appended to an existing file if the append mode is set in the configuration. If the filtering mode is set to append in the configuration, it will load and filter only the dates that are not already in the (possibly existing) dates auxiliary file and will append the filtered data to the (possibly already existing) file. Note that the append mode only works if:
the schema of the aggregated data is the same of the existing file
and the input files are in the same relative or global path as found in the dates auxiliary file.
and the parquet or csv output format is used.
Warning
Also note that, if using the created_at column to group, all the files produced with the append mode will have to be aggregated again on the desired keys as there is no guarantee that all the SoR from one day are in the corresponding daily dump file.
usage: dsa-tdb-cli filter [-h] -d INPUT [-p PLATFORM] [-v VERSION] -o OUTPUT
-c CONFIG [-i FROM_DATE] [-f TO_DATE] [-n N_WORKERS]
[-m MEMORY_LIMIT]
[--spark_local_dir SPARK_LOCAL_DIR]
[--override_chunked_subfolder OVERRIDE_CHUNKED_SUBFOLDER]
[--loglevel LOGLEVEL]
- -h, --help
show this help message and exit
- -d <input>, --input <input>
Root folder containing the different platforms / version directories.
- -p <platform>, --platform <platform>
Platform to filter the files from. [global|facebook|…|tiktok|X]
- -v <version>, --version <version>
Version of the files to filter. [full|light]
- -o <output>, --output <output>
Output file name in the format name.{}. The placeholder is for the extension.
- -c <config>, --config <config>
The YAML file containing the filtering configuration. See
dsa_tdb.types.FilteringConfig
for more details.
- -i <from_date>, --from_date <from_date>
Date from which (included) to look at dump files. Format YYYY-MM-DD. If not specified, will start from the first date found in the files.
- -f <to_date>, --to_date <to_date>
Date until (included) which to look at dump files. Format YYYY-MM-DD. If not specified, will end at the last date found in the files.
- -n <n_workers>, --n_workers <n_workers>
Number of workers to use for the filtering. If <=0, will use all the available CPUs.
- -m <memory_limit>, --memory_limit <memory_limit>
The driver memory of Spark. See https://spark.apache.org/docs/latest/configuration.html#memory-management for more details. Do not change it unless you know what you are doing.
- --spark_local_dir <spark_local_dir>
The local directory where to save the temporary files. If None/empty uses the Spark’s default. Do not change it unless you know what you are doing.
- --override_chunked_subfolder <override_chunked_subfolder>
Override the default chunked subfolder name. Do not change/use it unless you know what you are doing.
- --loglevel <loglevel>
The logging level. [DEBUG|INFO|WARNING|ERROR|CRITICAL]
dsa-tdb-cli preprocess
Download and chunk files from the webpage. Useful for the first download to save some disk space.
usage: dsa-tdb-cli preprocess [-h] -o OUTPUT [-v VERSION] [-p PLATFORM]
[-x EXCLUDE] [--skip_sha1] [-r] [-i FROM_DATE]
[-f TO_DATE] [-n NPROCS] [--skip_chunking]
[--chunk_size CHUNK_SIZE] [--format FORMAT] [-d]
[--override_chunked_subfolder OVERRIDE_CHUNKED_SUBFOLDER]
[--loglevel LOGLEVEL]
- -h, --help
show this help message and exit
- -o <output>, --output <output>
Output folder. This is the root folder where the files will be downloaded in the platform___version subfolder.
- -v <version>, --version <version>
Version of the files to download. [full|light]
- -p <platform>, --platform <platform>
Platform to download the files from. [global|facebook|…|tiktok|X]
- -x <exclude>, --exclude <exclude>
The comma separated list of the platforms to exclude, if global platform is specified. Use the original platform names withouth sanitization.
- --skip_sha1
Whether to skip the check of the sha1 of the newly downloaded file.
- -r, --force_sha1
Whether to force the sha1 of the already downloaded files. If the check fails, it will download the file again.
- -i <from_date>, --from_date <from_date>
Date from which to start downloading the files. Format YYYY-MM-DD.
- -f <to_date>, --to_date <to_date>
Date until which to download the files. Format YYYY-MM-DD.
- -n <nprocs>, --nprocs <nprocs>
Number of processes.
- --skip_chunking
Whether to skip chunking the files after downloading them.
- --chunk_size <chunk_size>
The number of SoRs to put in each chunk. Do not change it unless you know what you are doing.
- --format <format>
The format of the output files (“csv” or “parquet”), default “parquet”.
- -d, --delete_original
Whether to delete the original files, default False.
- --override_chunked_subfolder <override_chunked_subfolder>
Override the default chunked subfolder name. Do not change/use it unless you know what you are doing.
- --loglevel <loglevel>
The logging level. [DEBUG|INFO|WARNING|ERROR|CRITICAL]