file_tree_check package

Submodules

file_tree_check.fileChecker module

file_tree_check.fileChecker.check_permissions(path: str | Path, min_permissions: tuple[int, int, int] = (4, 4, 0))
file_tree_check.fileChecker.check_size(path, min_size: int = 50) bool
file_tree_check.fileChecker.get_total_file_count(path: str | Path, print_items: bool = False)

Count and optionally list all files in directory using pathlib.

file_tree_check.identifierEngine module

class file_tree_check.identifierEngine.IdentifierEngine(file_expression: str, directory_expression: str, check_file: bool)

Bases: object

Class that can extract an “identifier” from a path.

The identifier is a name that is used for comparison between directories and should not contain any part that is unique (like a subject number). The regular expression in the config files are relied on to remove these unique parts of the file/directory name.

Example:

“sub-15464521_image.png” and “sub-25484441_image.png” would both have the same identifier “_image.png” if using a regular expression that is keeping everything after the first “_” like “/_.*$/”.

Attributes

file_expression: string

The regular expression to filter the identifier from the name of files.

directory_expression: string

The regular expression to filter the identifier from the name of directories.

logger: logging.Logger

Logger to save info and debug message. Will send the log lines to the appropriate outputs following the logger configuration in main.py.

get_identifier(path: str | Path, parent: SmartPath | None, file_tree: FileTree | None) str
get_identifier_base(path: str | Path, prefix_file_with_parent_directory: bool = False) str

Extract the identifier from the file/directory.

The identifier should be the repeating part of the name that ties it to it’s type for comparison

“sept_6_weekly_report.txt” -> “_weekly_report.txt”

However the details of what precisely to extract and treat as an identifier is handled by the regular expressions given on creating the class instance. Since the regular expression is used with re.search(), only the first match is kept.

If no match is found, the entire file/directory name is used instead since we prefer to have identifier that are unique but can still be used in the output vs an empty identifier.

Extraction method with the default regular expression:

Files = "_.*$" ; Keep everything after the first "_". to remove the subject number.
Directories = "^.*-" ; Keep everything until the first "-".

; This way, subject directories like "sub-012012" are all aggregated
; under "sub-" while directory names without "-" are kept entirely.

Parameters

path: pathlib.Path or string

The path to the file/directory for which to extract the identifier.

prefix_file_with_parent_directory: bool, default = False

Whether to include the parent directory as a prefix to the file’s identifier. This is used to discriminate files that have the same name but are located under different subdirectories when filename are expected to be found at multiple places for each subject/configuration.

Returns

identifier: string

The path’s extracted identifier. Will be used to aggregate data on files/directories with the same identifier across the repeating file structure.

get_identifier_template(path: str | Path, templates: list[str], prefix_file_with_parent_directory: bool = True)
get_identifier_tree(path: str | Path, parent: SmartPath | None, tree: FileTree) str
get_identifier_tree_old(path: str | Path, tree: FileTree)
parse_string_to_regex(string)

file_tree_check.main module

Explore a file structure and build of distribution of file numbers and file size.

class file_tree_check.main.Configuration(pars: Parser)

Bases: object

Helper class for configuration.

Stores configuration parameters for easier passing to get_data functions.

file_tree_check.main.add_configuration(path: SmartPath, configurations: dict, target_depth: int | None = None, depth_range: bool = False, start_depth: int | None = None, end_depth: int | None = None, tree: FileTree | None = None) dict

For each directory look at how it’s content is structured and save that structure as a configuration.

Only compare directory of a specific depth relative to the original target directory. This prevents computing configurations for folder that aren’t relevant since we are expecting the repeating file structure to have each unit we want to compare at the same depth in the file structure.

Parameters

path: SmartPath

configurations: dict

Contains the file configurations found for each file/directory identifier with the following structure:

configurations={
    'identifier1':
        [ {'structure': ['identifier3', 'identifier4', 'identifier5'],
                            'paths': ['path1', 'path2']},
        {'structure': ['identifier3', 'identifier5'], 'paths': ['path4']},
        ... ]
    'identifier2':
        [{'structure': [], 'paths': []}, ...]
    }
identifier: IdentifierEngine

Used to extract the identifier of each path to aggregate it with similar ones resent in the file structure. Currently only used for the children in configuration, but this it to be updated to use file_tree identifier schema.

target_depth: int

The depth at which the repeating directories for which to compare their configuration will be. Any directory found at a different depth will be ignored by this function.

depth_range: bool

Whether to use a depth range to limit the frame of analysis

start_depth: int

The start of the depth range.

end_depth: int

The end of the depth range.

Returns

configurations: dict

The updated dict now containing the configuration data from the path that was given as parameter.

file_tree_check.main.data_from_paths_helper(path: SmartPath, measures: list[str] = [], configuration: Configuration | None = None, pipe_file_data: bool = False, stat_dict: dict = {}, configurations: dict = {}, tree: FileTree | None = None) tuple[dict, dict]

Must be data from paths helper function.

Calls add_stats method and add_configuration if specified. Also pipes data to standard out if specified.

file_tree_check.main.generate_tree(root: str | Path, criteria: Pattern | None = None, filter_files: bool = False, filter_dir: bool = False, filter_hidden: bool = False, depth_limit: int = None, ignore: list = None, file_tree: FileTree = None)

Create a SmartFilePath or SmartDirectoryPath generator object. # noqa: D410 D411 D400

These generator objects will have to be iterated over to get the SmartPath items themselves. Returns an iterable of the file structure that can be used over each element. When a subdirectory is found under the given path, will call another instance of this method with it as the root.

Parameters

root: string or pathlib.Path

The path for which to generate the SmartPath instance and recursively run this function on it’s children files and directories.

parent: SmartPath

The instance of the parent SmartPath. This reference allows the children SmartPath to calculate their depth relative to the first root of the file structure.

is_last: bool

Indicate whether or not this file/directory was the last to be generated in it’s parent folder. Only relevant for visual display of the file structure.

criteria: re.Pattern

A regular expression compiled into a Pattern object to be used to filter_files and/or directories included in the generator output. Files/directories that do not match the regular expression will be discarded including all their children regardless of their name for directories. If no criteria is given, every file and directory will be included in the generation.

filter_files: bool

Whether or not the search criteria will be used to discard files whose names do not match the regular expression.

filter_dir: bool

Whether or not the search criteria will be used to discard directories whose names do not match.

filter_hidden: bool

Whether or not to discard hidden files and directories. Hidden files and directories are those whose name starts with a dot.

ignore: list

A list of file and directory names to ignore.

depth_limit: int

The maximum depth to which to generate the file structure relative to the root (level 0).

file_tree: FileTree

The FileTree object that will be used to template the file structure and assign identities to each file and directory.

Yields

generator object

Each execution of generate_tree yields a single instance of SmartFilePath or SmartDirectoryPath to the generator that will create them when iterated upon. However, since another generate_tree() function is called on each children found, while each function yield a single object, the end result will be that a object will be yielded for every file and directory found in the initial root (or for every one that matches the criteria). Iterating over the output generator object will then allow to act on a SmartPath instance of every file and directory after only having to call ourself generate_tree() once on the target folder.

file_tree_check.main.generate_tree_actual(smart_root: SmartDirectoryPath, is_last: bool = False, criteria: Pattern | None = None, filter_files: bool = False, filter_dir: bool = False, filter_hidden: bool = False, depth_limit: int = None, ignore: list = None, file_tree: FileTree = None)
file_tree_check.main.get_data_from_paths(paths, output_path: Path | None = None, measures: list[str] = [], configuration: Configuration | None = None, pipe_file_data: bool = False, tree: FileTree | None = None) tuple[dict, dict]

Iterate over each file/directory in the generator to get measure. # noqa: D410 D411 D400

Parameters

paths: iterable containing SmartPath objects

Expected to be the generator object created by generate_tree() but can theoretically be any iterable containing SmartPath objects.

identifier: IdentifierEngine

Used to extract the identifier of each path to aggregate it with similar ones resent in the file structure. This IdentifierEngine is also passed to add_configuration to allow it to extract identifiers as well. This is mostly deprecated at this point.

output_path: pathlib.Path

The path to the text file where the file tree output will be saved. If none, the type of output is skipped.

measures: list of string

The name of the measures to be used in the outputs. Each corresponds to a dictionary nested in stat_dict.

configuration: Configuration

configuration object that contains the following attributes:

  • target_depth: int passed to add_configuration() to specify which depth of folder.

  • get_configurations: bool passed to add_configuration() to specify whether to get the configuration.

  • depth_range: bool passed to add_configuration() to specify whether to use the depth range.

  • start_depth: int passed to add_configuration() to specify the start depth of the range.

  • end_depth: int passed to add_configuration() to specify the end depth of the range.

  • limit_depth: bool passed to add_configuration() to specify whether to use the depth limit.

  • depth_limit: int passed to add_configuration() to limit depth of analysis relative to root.

pipe_file_data: bool

Whether to output the data from each file found directly to the standard output during the execution. By default this will print in the console which is not recommended for large dataset. If the script is followed by a pipe, this will pass the data to the other script or command. Only outputs files because directories shouldn’t be relevant for the custom tests. Outputted format is a single string per file in the format: ‘path,identifier,file_size,modified_time’. File_size is in bytes, modified_time is in seconds (epoch time).

tree: FileTree

The FileTree object used for templating not used currently.

Returns

stat_dict: dict The dictionary containing the the values for each measures.

stat_dict contains nested dictionaries with the following structure:

stat_dict={
    'measure1':
        {'identifier1': {
            'path1': value, 'path2': value, ...},
        'identifier2': {
            'path3': value, 'path4': value}, ...},
        }
    'measure2':
        {'identifier1': {}, 'identifier2': {}, ...}
    }
configurations: dict

Contains the file configurations found for each file/directory identifier with the following structure:

configurations={
    'identifier1':
        [ {'structure': ['identifier3', 'identifier4', 'identifier5'],
                          'paths': ['path1', 'path2']},
        {'structure': ['identifier3', 'identifier5'], 'paths': ['path4']},
        ... ]
    'identifier2':
        [{'structure': [], 'paths': []}, ...]
    }
file_tree_check.main.main()

file_tree_check.smartDirectoryPath module

class file_tree_check.smartDirectoryPath.SmartDirectoryPath(path: Path, parent_smart_path: SmartPath | None, is_last: bool, file_tree: FileTree | None = None)

Bases: SmartPath

The Child class of SmartPath for directories (folder).

add_children(child: SmartPath) None
property dir_count: int

The number of directory found directly under this one.

display(measures=(), name_max_length: int = 60)

Call the SmartPath display and add some relevant measures to be printed alongside it.

property file_count: int

For a directory, indicates how many files are directly under it.

Does not count subdirectories or files contained in them.

file_tree_check.smartFilePath module

class file_tree_check.smartFilePath.SmartFilePath(path: Path, parent_smart_path: SmartPath | None, is_last: bool, file_tree: FileTree | None = None)

Bases: SmartPath

The Child class of SmartPath for files.

add_children(child: SmartPath) None
property children: list[file_tree_check.smartPath.SmartPath]
property dir_count

Since this is not a directory, this measure is meaningless and the return None is handled by the calling function.

display(measures=(), name_max_length: int = 60) str

Call the SmartPath display and add some relevant measures to be printed alongside it.

property file_count

Since this is not a directory, this measure is meaningless and the return None is handled by the calling function.

file_tree_check.smartPath module

class file_tree_check.smartPath.SmartPath(path: Path, parent_smart_path: SmartPath | None, is_last: bool, file_tree: FileTree | None = None)

Bases: ABC

A SmartPath object is tied to a singular path (file or directory) and allows itself to be printed in a readable format and allow retrieval of some statistics.

Each instance stores their parent directory and their depth relative to the first path.

This is an abstract class for both files and directory Paths.

Attributes

path: pathlib.Path

The path to the file/directory in question.

parent: SmartPath or None

Reference to the parent SmartPath. Used to determine this path’s depth recursively.

is_last: bool

Whether or not this path is the last one to be displayed in his directory. Used to create the tree-like output.

depth: int

The path’s depth in the file structure relative to the initial target directory.

Credit to stack overflow abstrus for the visual part

add_parent() None
add_stats(stat_dict: dict, identifier: str, measures: list[str] = []) dict

For each measure desired adds the value from this path to the dictionary.

Parameters

stat_dict: dict The dictionary containing the the values for each measures.

stat_dict contains nested dictionaries with the following structure:

stat_dict={
    'measure1':
        {'identifier1': {
            'path1': value, 'path2': value, ...},
        'identifier2': {
            'path3': value, 'path4': value}, ...},
        }
    'measure2':
        {'identifier1': {}, 'identifier2': {}, ...}
    }
identifier: string

The path’s identifier. Used to aggregate this path’s values to the correct place in order to add it with files/directories with the same identifier across the repeating file structure.

measures: list of string

The name of the measures to be used in the outputs. Each corresponds to a dictionary nested in stat_dict.

Returns

dict

The same dictionary that was given but with the path’s values added in.

abstract dir_count()
display(measures=(), name_max_length: int = 60) str

Return the name of the file/folder with whitespaces to fit the standard length.

Parameters

measures: list of string name_max_length: int

Returns

string

display_filename_prefix_last = '└──'
display_filename_prefix_middle = '├──'
display_parent_prefix_last = '│   '
display_parent_prefix_middle = '    '
displayable(measures=(), name_max_length: int = 60) str

Return a string corresponding to a single line in the file structure tree visualisation.

abstract file_count()
property file_size: int
get_identifier(path: Path, parent_smart_path: SmartPath | None, file_tree: FileTree | None) str

Determine which identifier function to use.

Currently only tree one is implemented to use.

get_identifier_base(path: str | Path) str
get_identifier_tree(path: str | Path, parent: SmartPath | None, tree: FileTree | None) str

Determine the identifier of the path based off the file_tree templates.

If path has a parent, searches based off templates of the parent. From the relative subset of templates, finds the one with the longest match. If no match is found then returns the path’s name.

identifier() str
property modified_time: int
parse_string_to_regex(string) Pattern

Convert a string to a regex pattern based off file_tree templates.

template_children(tree: FileTree | None, parent_template: Template | None) dict

Return a dictionary of the children of the parent_template.

unique_config(tree: FileTree | None, template: Template | None, path: str | Path) dict

file_tree_check.statBuilder module

class file_tree_check.statBuilder.StatBuilder(stat_dict, measures=(), size_averaging: float = 0.0, time_averaging: int = 0)

Bases: object

Store the data in a dictionary and create the output plots and files.

Attributes

stat_dict: dict

The dictionary containing the the values for each measures. stat_dict contains nested dictionaries with the following structure:

stat_dict={
    'measure1':
        {'identifier1': {
            'path1': value, 'path2': value, ...},
        'identifier2': {
            'path3': value, 'path4': value}, ...},
        }
    'measure2':
        {'identifier1': {}, 'identifier2': {}, ...}
    }
measures: list of string

The name of the measures to be used in the outputs. Each name corresponds to a dictionary nested in stat_dict.

logger: logging.Logger

Logger to save info and debug message. Will send the log lines to the appropriate outputs following the logger configuration in main.py.

average_file_size(stat_dict, size_averaging) dict
average_modified_time(stat_dict, time_averaging) dict
create_csv(output_path: str | Path)

Produce the CSV (comma-separated value) file output at the target path.

As the name says, a CSV contains values separated by a comma. In this case, each line represent a single file/directory with it’s identifier and the measures done.

Format of each line is:

path,identifier,measure1,measure2,measure3,…

Order of measures (but presence depends on config file option):

file_count, dir_count, file_size, modified_time

All are integers:

  • file_count = number of files directly under given directory (ignores files inside subdirectories)

  • dir_count = number of directories directly under given directory (ignores inside subdirectories)

  • file_size = size in bytes rounded to the nearest integer

  • modified_time = time of last modification in number of seconds since 1st January 1970 (Epoch time)

Parameters

output_path: pahlib.Path or string

Path to where the CSV should be saved.

create_plots(save_path=None, show_plot=True, plots_per_measure=8)

Create a comparison plot for each measure given in a single figure.

One row of plots per measure. Will show the distribution for a given amount of file/directory identifier per measure. Will prioritize the distributions with the highest amount of data points.

E.g. will show distributions for a directory type that was found 1000 time before directories that were found 500 times in the file structure since each directory contributes one data point to it’s directory type distribution.

Parameters

save_path: pathlib.Path or string, default=None

The path to where the graphs should be saved. If not given or None, will not save the plots as file.

show_plot: bool, default=True

Whether to show or not the plots with matplotlib.pyplot.show() before exiting the function.

plots_per_measure: int

How many identifiers will be included in the plots, starting from the ones with the highest number of occurrences. Corresponds to the number of column of the plot figure, it’s rows being dictated by the number of measures taken.

create_summary(root: Path, configurations: dict) str

Produce the ‘Summary’ text file output.

This output highlights common file configurations if requested and will point to outliers for each measure and file/directory type.

Parameters

root: pahlib.Path

Path to the root directory (target directory) of the file structure.

configurations: dict

Contains the file configurations found for each file/directory identifier with the following structure:

configurations={
    'identifier1': [
            {'structure': ['identifier3', 'identifier4', 'identifier5'],
            'paths': ['path1', 'path2']
            },
            {'structure': ['identifier3', 'identifier5'],
            'paths': ['path4']
            },
            ...
        ]
    'identifier2':
        [{'structure': [], 'paths': []}, ...]
    }

Returns

string

The entire generated summary text as a single string, to be handled by main.py for saving or printing.

Module contents

Structure of the script.

  • main.py is where the sequential series of processing takes places.

  • statBuilder.py handles the creation of the output files.

  • identifierEngine.py contains the IdentifierEngine class.

    This class is used to extract the identifier string from files and directories based on the regular expression given to it (from the config file).

  • smartPath.py contains the SmartPath abstract class.

    The SmartFilePath and SmartDirectoryPath both inherit from it.

  • smartFilePath.py contains the SmartDirectoryPath class,

    the implementation of SmartPath for files.

  • smartDirectoryPath.py contains the SmartFilePath class,

    the implementation of SmartPath for directories.