url2features package

Submodules

url2features.cli module

url2features.main: provides entry point main().

url2features.cli.get_cmd_line_params(argv)[source]: parse out the option from an array of command line arguments

url2features.cli.main()[source]: Main url2features application entry point. parses out CL options and determine the size of the file. Then process the file for the requested features

url2features.cli.print_usage(args)[source]: Command line application usage instrutions.

url2features.config module

url2features.dns module

url2features.dns.add_dns_features(df, col, add_prefix=True)[source]: Given a pandas dataframe and a column name containing a URL calculate the DNS features

url2features.dns.count_ips(url_parts)[source]: Return the number of resolved IPs (IPv4).

url2features.dns.count_mx_servers(url_parts)[source]: Return Number of Resolved MX Servers.

url2features.dns.count_name_servers(url_parts)[source]: Return number of NameServers (NS) resolved.

url2features.dns.dns_features(df, columns, add_prefix=True)[source]: Given a pandas dataframe and a set of column names. calculate the DNS summary features and add them.

url2features.dns.get_country(url_parts)[source]: Return the country associated with IP.

url2features.dns.get_ptr(url_parts)[source]: Return PTR associated with IP.

url2features.dns.split_url_into_parts(url)[source]: Split URL into: protocol, host, path, params, query and fragment.

url2features.dns.valid_ip(host)[source]: Return if the domain has a valid IP format (IPv4 or IPv6).

url2features.domain module

url2features.extension module

url2features.featurize module

url2features.featurize.generate_feature_function(parameters)[source]: This function will take the processed command line arguments that determine the feature to apply and partially apply them to the process_df function. Returning a function that can be used to apply those parameters to multiple chunks of a dataframe.

url2features.featurize.process_df(df, params)[source]: process_df: Function that co-ordinates the process of generating the features

url2features.file module

url2features.file.add_file_features(df, col, add_prefix)[source]: Given a pandas dataframe and a column name. calculate the file features

url2features.file.extract_word_stats(path)[source]

url2features.file.file_extension_lookup(ext)[source]: Given a file extension returns the type

url2features.file.file_extension_lookup_old(ext)[source]: Given a file extension returns its frequency and type

url2features.file.file_features(df, columns, add_prefix=True)[source]: Given a pandas dataframe and a set of column names. calculate the file type summary features and add them.

url2features.file.remove_extension(file_name)[source]

url2features.pipeline module

class url2features.pipeline.URLTransform(columns, transforms=['simple'])[source]

Bases: TransformerMixin, BaseEstimator

This class implements a SciKit Learn compatible Transformer for: converting a column containing a URL into a series of numeric values. You specify the transformations you want as an array of named feature groups.

columns: Array(String) Names of the text columns to process. transforms: Array(String) Names of the feature sets to generate

Options: simple, domain, extension

fit(X, y=None, **fit_params)[source]

generate_feature_config(columns, params)[source]: We need to process the params into a particular data structure for the dataframe processing function to recognize.

transform(X, y=None, **transform_params)[source]

Transform the matrix of values: – need to deal with single or multiple columns

url2features.process module

url2features.process.add_protocol_if_missing(x)[source]: Determine if the URL begins with any form of protocol and add a default protocol if it is absent.

url2features.process.count_lines(path_to_file)[source]: Return a count of total lines in a file. In a way that filesize is irrelevant

url2features.process.end_profile(proc_name)[source]

url2features.process.eprint(*args, **kwargs)[source]

url2features.process.extract_file_extension(path_to_file)[source]

url2features.process.initialise_profile()[source]

url2features.process.isNaN(num)[source]

url2features.process.len_or_null(val)[source]: Alternative len function that will simply return numpy.NA for invalid values. This is needed to get sensible results when running len over a column that may contain nulls

url2features.process.load_complete_dataframe(path_to_file)[source]: We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it appears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.

url2features.process.load_dictionary(filename, escape=False)[source]: Utility function to load a json serialised dictionary

url2features.process.load_file(filename)[source]: Utility function to load a raw data file

url2features.process.load_word_list(filename, escape=False)[source]: Utility function to load topic vocab word lists for pattern matching.

url2features.process.load_word_pattern(filename, prefix='', pluralize=True, bound=True, escape=False)[source]

url2features.process.padded(k, padto=20)[source]

url2features.process.print_output(df, header=True)[source]

url2features.process.print_profiles()[source]

url2features.process.process_file_in_chunks(path_to_file, function_to_apply)[source]: Given a path to a large dataset we will iteratively load it in chunks and apply the supplied function to and write the result to the output stream.

url2features.process.remove_escapes_and_non_printable(text)[source]: Apply the codecs escape to decode any escaped characters. Then apply a regex to remove any non printable characters

url2features.process.remove_tags(text)[source]

url2features.process.remove_urls(text)[source]

url2features.process.remove_urls_and_tags(text)[source]: Remove any obvious text elements that appear to be either URLs or HTML tags

url2features.process.start_profile(proc_name)[source]

url2features.registration module

url2features.registration.get_domain_registration_date(domain)[source]

url2features.registration.get_registration_year(domain)[source]

url2features.simple module

url2features.simple.add_simple_features(df, col, add_prefix=True)[source]: Given a pandas dataframe and a column name. calculate the simple features

url2features.simple.null_tolerant_depth(x)[source]

url2features.simple.null_tolerant_len(x)[source]

url2features.simple.remove_protocol_and_trim(url)[source]

url2features.simple.simple_features(df, columns, add_prefix=True)[source]: Given a pandas dataframe and a set of column names. calculate the simple text summary features and add them.

url2features.suffixes module

url2features.suffixes.split_domain_and_suffix(domain)[source]: Strip a domain of its suffix