Configuration
Modifying the config file is the way to tell the program what you are looking for, what outputs do you want and what those outputs should contain. This section will explain how to modify it and what does every command is used for.
How to edit a config.ini file
An .ini file is simply a text file with a few rules to allow it to be readable by the parsing function. stuff More information on the syntax of config file and how they are read by the configparser python library can be found here
Config Options Breakdown
This section will provide explanation for each option in the config file. The .ini structure is quite flexible and easy to read but here is some important fact to know :
Boolean values can be “true”, “false” or “yes”, “no” or a mix of the two without creating any trouble.
In most sections, a bool option will enable the functionality that will then import the following option it needs. This means that when desactivating a feature by setting the bool to false the relevant following options can be ignored and left empty.
Not putting any values after the “=” will produce an empty string as the value.
The type mentioned here for each option is how the script will import and interpret the value. This means for example that an option that is expecting an int will raise an error if you write letters in the value.
Categorization
regular_expression_file_identifier = string
The regular expression to filter the identifier from the name of files. Will be used to re.search() on any file’s name. The first match found will be kep as identifier. (deprecated)
regular_expression_directory_identifier = string
The regular expression to filter the identifier from the name of directories. (deprecated)
Search Criteria (search criteria use is not recommended at this time)
use_search_criteria = bool
Whether or not to filter_files and/or directory to extract data and metrics from only the subset that match the regular expressions below.
regular_expression_search_criteria = string
A regular expression to be used to filter_files and/or directories included in the analysis. Uses re.match() to filter with the regular expression.
Filter
filter_files = bool
Whether or not the search criteria will be used to discard files whose names do not match the regular expression.
filter_directories = bool
Whether or not the search criteria will be used to discard directories whose names do not match.
filter_custom = bool
Whether or not to use a custom list to discard specific files and directories. Must be used in conjunction with filter_custom_list.
filter_custom_list = string
List of file and directories names for program to ignore. Deliniate the names with commas ‘,’. White spaces will be trimmed. Ex: code,logs,sourcedata or code, logs, sourcedata
Measures
file_count = bool
Take or not the measure of the number of files present in each directory (does not include files in sub-directories).
dir_count = bool
Take or not the measure of the number of sub-directories present in each directory (does not include directories nested inside those sub-directories).
file_size = bool
Take or not the measure of the size of the file/directory in bytes. If using this measure it is highly recommended to use size_rounding_percentage.
modified_time = bool
Take or not the measure of the time of last modification, in seconds since 1st January 1970 (epoch time). If using this recommended to use time_rounding_seconds as well.
Measures.Averaging
time_rounding_seconds = integer
For each file type determined by the program the mean modified time is calculated. This parameter will round all files within the specified number of seconds from the mean.
size_rounding_percentage = float
For each file type determined by the program the mean size is calculated. This parameter specifies what percent of deviation from the mean to round to. If parameter is set to .01 then a percentage of 1% is used.
Output
create_summary = bool
Whether or not to create the text summary file that will highlights common file configurations if requested and will point to outliers/norm for each measure and file/directory type.
summary_output_path = string
The path to where the text summary file should be saved. By default is saved to a directory called results in current working directory.
create_text_tree = bool
Whether or not to create the tree-like file structure visualization in a text file.
tree_output_path = string
The path to the text file where the file tree output will be saved. If none, the type of output is skipped. By default is saved to a directory called results in current working directory.
create_csv = bool
Whether or not to create the csv file containing a row for each file and directory found along with the metrics for each.
csv_output_path = string
Path to where the CSV should be saved. By default is saved to a directory called results in current working directory.
Output.Visualization
Use is not recommended at this time.
create_plots = bool
Whether or not to create the plots that will show the distribution of the collected metrics between files and directories with the same identifier
number_plot_per_measure = int
How many identifiers will be included in the plots, starting from the ones with the highest number of occurrences. Corresponds to the number of column of the plot figure, it’s rows being dictated by the number of measures taken.
print_plots = bool
Whether to show or not the plots with matplotlib.pyplot.show() before exiting the function.
save_plots = bool
Whether or not to save the plots generated as a image file.
image_path = string
The path to where the graphs should be saved. If not given or None, will not save the plots as file.
Output.Piping
pipe_data = bool
Whether to output the data from each file found directly to the standard output during the execution. By default this will print in the console which is not recommended for large dataset. If the script is followed by a pipe, this will pass the data to the other script or command. Only outputs files because directories shouldn’t be relevant for the custom tests. Outputted format is a single string per file in the format : ‘path,identifier,file_size,modified_time’. File_size is in bytes, modified_time is in seconds (epoch time).
Configurations
get_configurations = bool
Whether or not to compare the configuration of the folders in the repeating structure. Will display in summary the different configurations for a directory type. The configuration is a list of the directories and files within the directory.
target_depth = int
Specify which depth of folder to use for configuration comparison. Useful if you only want to compare one level. (e.g. if comparing each sub folders found directly under the target folder given to the script, depth=1)
use_depth_range = bool
Whether or not to specify a range of depths to compare configurations.
Must be used in conjunction with range_start and range_end.
range_start = int
The beginning of range to compare configurations. This is inclusive, so if range_start = 0, then the root directory will be included.
range_end = int
The end of range to compare configurations. This is inclusive as well.
limit_depth = bool
Whether or not to limit depth of tree discovery. This could be useful when dealing with very large trees and not looking at contents beyond a certain depth. The specified depth limit. This is inclusive so if depth limit = 1 then children of root directory will be found but nothing deeper than that.
Logging
log_path = string
The path to the file where to save the logs. If is None, will not save path to any files. By default saves to results folder in current workind directory.
file_log_level = string
The level of logging for the log file. Either:
“CRITICAL”
“ERROR”
“WARNING”
“INFO”
“DEBUG”
“NOTSET”
Input
root_config = bool
Whether or not to use root path from config file.
root_path = Path
This is the path to the directory you would like the program to search on.
use_file_tree = bool
Whether or not to use file_tree. By default is yes, because program depends on file_tree usage.
file_tree_path = Path
Path to file_tree to be used.