Usage Guide
Command Line Utility
After installation with pip, url2features can be invoked from the command line:
>url2features
Without parameters it will print out an error and the following usage :
ERROR: MISSING ARGUMENTS
USAGE
url2features [ARGS] <PATH TO DATASET>
<PATH TO DATASET> - Supported file types: csv, tsv, xls, xlsx, odf
[ARGS] In most cases these are switches that turn on the feature type
-columns=<COMMA SEPARATED LIST>. REQUIRED
-simple Default: False. Features derived from the URL string: length, depth, components
-host Default: False. Features from the subdomain and domain registration (requires internet).
-tld Default: False. Features about the Top Level Domain.
-protocol Default: False. Features from the URL protocol.
-path Default: False. Features derived from the path between host and file
-file Default: False. Features derived from the final file type
-dns Default: False. Features derived from the DNS records (requires internet).
-np Deactivate use of column name prefix. Only works for a single column.
The list of columns to process and the path to the dataset are both mandatory.
The rest of the options turn on or off particular groups of features.
Python Package Usage
You can import the url2features package within python and then make use of the SciKit Learn Compatible Transformer for your ML Pipeline.
In the example below we initialise a URLTransform object that will generate the host and top level domain features for any dataframe that has a column of data named ‘URL_COL_NAME’
from url2features.pipeline import URLTransform
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('urltransform', URLTransform(['URL_COL_NAME'],['host','tld']) ),
('clf', SGDClassifier(loss='log') ),
])
Note that the transformer version of url2features will remove the original URL columns so that the resulting data set can be fed into an algorithm that requires numerical columns only. This means that if you need to do any other text feature engineering it be placed earlier in the pipeline.