Configuration

Modifying the config file is the way to tell the program what you are looking for, what outputs do you want and what those outputs should contain. This section will explain how to modify it and what does every command is used for.

How to edit a config.ini file

An .ini file is simply a text file with a few rules to allow it to be readable by the parsing function. stuff More information on the syntax of config file and how they are read by the configparser python library can be found here

Config Options Breakdown

This section will provide explanation for each option in the config file. The .ini structure is quite flexible and easy to read but here is some important fact to know :

  • Boolean values can be “true”, “false” or “yes”, “no” or a mix of the two without creating any trouble.

  • In most sections, a bool option will enable the functionality that will then import the following option it needs. This means that when desactivating a feature by setting the bool to false the relevant following options can be ignored and left empty.

  • Not putting any values after the “=” will produce an empty string as the value.

  • The type mentioned here for each option is how the script will import and interpret the value. This means for example that an option that is expecting an int will raise an error if you write letters in the value.

Categorization

regular_expression_file_identifier = string

The regular expression to filter the identifier from the name of files. Will be used to re.search() on any file’s name. The first match found will be kep as identifier. (deprecated)

regular_expression_directory_identifier = string

The regular expression to filter the identifier from the name of directories. (deprecated)

Filter

filter_files = bool

Whether or not the search criteria will be used to discard files whose names do not match the regular expression.

filter_directories = bool

Whether or not the search criteria will be used to discard directories whose names do not match.

filter_hidden = bool

Whether or not to discard hidden files. These are files that begin with a ‘.’. Turned on by default.

filter_custom = bool

Whether or not to use a custom list to discard specific files and directories. Must be used in conjunction with filter_custom_list.

filter_custom_list = string

List of file and directories names for program to ignore. Deliniate the names with commas ‘,’. White spaces will be trimmed. Ex: code,logs,sourcedata or code, logs, sourcedata

Measures

file_count = bool

Take or not the measure of the number of files present in each directory (does not include files in sub-directories).

dir_count = bool

Take or not the measure of the number of sub-directories present in each directory (does not include directories nested inside those sub-directories).

file_size = bool

Take or not the measure of the size of the file/directory in bytes. If using this measure it is highly recommended to use size_rounding_percentage.

modified_time = bool

Take or not the measure of the time of last modification, in seconds since 1st January 1970 (epoch time). If using this recommended to use time_rounding_seconds as well.

Measures.Averaging

time_rounding_seconds = integer

For each file type determined by the program the mean modified time is calculated. This parameter will round all files within the specified number of seconds from the mean.

size_rounding_percentage = float

For each file type determined by the program the mean size is calculated. This parameter specifies what percent of deviation from the mean to round to. If parameter is set to .01 then a percentage of 1% is used.

Output

create_summary = bool

Whether or not to create the text summary file that will highlights common file configurations if requested and will point to outliers/norm for each measure and file/directory type.

summary_output_path = string

The path to where the text summary file should be saved. By default is saved to a directory called results in current working directory.

create_text_tree = bool

Whether or not to create the tree-like file structure visualization in a text file.

tree_output_path = string

The path to the text file where the file tree output will be saved. If none, the type of output is skipped. By default is saved to a directory called results in current working directory.

create_csv = bool

Whether or not to create the csv file containing a row for each file and directory found along with the metrics for each.

csv_output_path = string

Path to where the CSV should be saved. By default is saved to a directory called results in current working directory.

Output.Visualization

Use is not recommended at this time.

create_plots = bool

Whether or not to create the plots that will show the distribution of the collected metrics between files and directories with the same identifier

number_plot_per_measure = int

How many identifiers will be included in the plots, starting from the ones with the highest number of occurrences. Corresponds to the number of column of the plot figure, it’s rows being dictated by the number of measures taken.

save_plots = bool

Whether or not to save the plots generated as a image file.

image_path = string

The path to where the graphs should be saved. If not given or None, will not save the plots as file.

Output.Piping

pipe_data = bool

Whether to output the data from each file found directly to the standard output during the execution. By default this will print in the console which is not recommended for large dataset. If the script is followed by a pipe, this will pass the data to the other script or command. Only outputs files because directories shouldn’t be relevant for the custom tests. Outputted format is a single string per file in the format : ‘path,identifier,file_size,modified_time’. File_size is in bytes, modified_time is in seconds (epoch time).

Configurations

get_configurations = bool

Whether or not to compare the configuration of the folders in the repeating structure. Will display in summary the different configurations for a directory type. The configuration is a list of the directories and files within the directory.

target_depth = int

Specify which depth of folder to use for configuration comparison. Useful if you only want to compare one level. (e.g. if comparing each sub folders found directly under the target folder given to the script, depth=1)

use_depth_range = bool

Whether or not to specify a range of depths to compare configurations.

Must be used in conjunction with range_start and range_end.

range_start = int

The beginning of range to compare configurations. This is inclusive, so if range_start = 0, then the root directory will be included.

range_end = int

The end of range to compare configurations. This is inclusive as well.

limit_depth = bool

Whether or not to limit depth of tree discovery. This could be useful when dealing with very large trees and not looking at contents beyond a certain depth. The specified depth limit. This is inclusive so if depth limit = 1 then children of root directory will be found but nothing deeper than that.

Logging

log_path = string

The path to the file where to save the logs. If is None, will not save path to any files. By default saves to results folder in current workind directory.

file_log_level = string

The level of logging for the log file. Either:

  • “CRITICAL”

  • “ERROR”

  • “WARNING”

  • “INFO”

  • “DEBUG”

  • “NOTSET”

Input

root_config = bool

Whether or not to use root path from config file.

root_path = Path

This is the path to the directory you would like the program to search on.

use_file_tree = bool

Whether or not to use file_tree. By default is yes, because program depends on file_tree usage.

file_tree_path = Path

Path to file_tree to be used.