url2features package
Submodules
url2features.cli module
url2features.main: provides entry point main().
- url2features.cli.get_cmd_line_params(argv)[source]
parse out the option from an array of command line arguments
url2features.config module
url2features.dns module
- url2features.dns.add_dns_features(df, col, add_prefix=True)[source]
Given a pandas dataframe and a column name containing a URL calculate the DNS features
- url2features.dns.dns_features(df, columns, add_prefix=True)[source]
Given a pandas dataframe and a set of column names. calculate the DNS summary features and add them.
url2features.domain module
url2features.extension module
url2features.featurize module
- url2features.featurize.generate_feature_function(parameters)[source]
This function will take the processed command line arguments that determine the feature to apply and partially apply them to the process_df function. Returning a function that can be used to apply those parameters to multiple chunks of a dataframe.
url2features.file module
- url2features.file.add_file_features(df, col, add_prefix)[source]
Given a pandas dataframe and a column name. calculate the file features
- url2features.file.file_extension_lookup_old(ext)[source]
Given a file extension returns its frequency and type
url2features.pipeline module
- class url2features.pipeline.URLTransform(columns, transforms=['simple'])[source]
Bases:
TransformerMixin,BaseEstimator- This class implements a SciKit Learn compatible Transformer for
converting a column containing a URL into a series of numeric values. You specify the transformations you want as an array of named feature groups.
columns: Array(String) Names of the text columns to process. transforms: Array(String) Names of the feature sets to generate
Options: simple, domain, extension
url2features.process module
- url2features.process.add_protocol_if_missing(x)[source]
Determine if the URL begins with any form of protocol and add a default protocol if it is absent.
- url2features.process.count_lines(path_to_file)[source]
Return a count of total lines in a file. In a way that filesize is irrelevant
- url2features.process.len_or_null(val)[source]
Alternative len function that will simply return numpy.NA for invalid values. This is needed to get sensible results when running len over a column that may contain nulls
- url2features.process.load_complete_dataframe(path_to_file)[source]
We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it appears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.
- url2features.process.load_dictionary(filename, escape=False)[source]
Utility function to load a json serialised dictionary
- url2features.process.load_word_list(filename, escape=False)[source]
Utility function to load topic vocab word lists for pattern matching.
- url2features.process.load_word_pattern(filename, prefix='', pluralize=True, bound=True, escape=False)[source]
- url2features.process.process_file_in_chunks(path_to_file, function_to_apply)[source]
Given a path to a large dataset we will iteratively load it in chunks and apply the supplied function to and write the result to the output stream.
- url2features.process.remove_escapes_and_non_printable(text)[source]
Apply the codecs escape to decode any escaped characters. Then apply a regex to remove any non printable characters