login_field_detector package

Submodules

login_field_detector.field_detector module

class login_field_detector.field_detector.LoginFieldDetector(model_dir=None, labels=['UNLABELED', 'USERNAME', 'PHONE_NUMBER', 'PASSWORD', 'LOGIN_BUTTON', 'TWO_FACTOR_AUTH', 'SOCIAL_LOGIN_BUTTONS', 'CAPTCHA', 'LANGUAGE_SWITCH', 'FORGOT_PASSWORD', 'SIGN_UP', 'REMEMBER_ME', 'HELP_LINK', 'PRIVACY_POLICY', 'TERMS_OF_SERVICE', 'NAVIGATION_LINK', 'BANNER', 'ADVERTISEMENTS', 'COOKIE_POLICY', 'IMPRINT'], device=None)

Bases: object

Model for login field detection using BERT.

create_dataset(inputs, labels): Align 1D labels with tokenized inputs.

evaluate(dataset): Evaluate the model on a dataset and plot confusion matrix.

plot_confusion_matrix(true_labels, pred_labels): Plot confusion matrix with improved x-axis label spacing.

predict(url=None, html_content=None, probability_threshold=0.9)

Make predictions on new HTML content.

Allowing multiple entries per label above a specified probability threshold and sorted by probability.

process_urls(urls, force=False, o_label_ratio=0.5, screenshots=False)

Preprocess the urls for training the model and balance the data

Parameters:

urls – List of URLs to process.
force – Force re-fetching all URLs.
screenshots – Whether you want to screenshot the urls being processed
o_label_ratio – Ratio of ‘UNLABELED’ labels to retain.

Returns:

Filtered inputs and labels.

train(urls=None, epochs=10, batch_size=16, force=False, screenshots=False)

Train the model.

Parameters:

urls – List of URLs to fetch and process for training.
epochs – Number of training epochs.
batch_size – Batch size for training.
screenshots – Whether you want to screenshot the urls being processed
force – Force re-fetching and reprocessing all URLs.

visualize_class_distribution(labels): Plot class distribution.

class login_field_detector.field_detector.WeightedTrainer(class_weights, device, *args, **kwargs)

Bases: Trainer

Custom Trainer with weighted loss for imbalanced classes.

compute_loss(model, inputs, return_outputs=False, **kwargs)

How the loss is computed by Trainer. By default, all models return the loss in the first element.

Subclass and override for custom behavior.

login_field_detector.field_detector.compute_metrics(pred)

login_field_detector.field_detector.download_model_files(root_dir)

Downloads the necessary model files from Hugging Face Hub.

Returns paths to the downloaded model and tokenizer files.

login_field_detector.html_feature_extractor module

class login_field_detector.html_feature_extractor.HTMLFeatureExtractor(label2id, oauth_providers=None)

Bases: object

get_features(html_text): Extract tokens, labels, xpaths, and bounding boxes from an HTML file.

login_field_detector.html_feature_extractor.determine_label(tag): Determine the label of an HTML tag based on patterns.

login_field_detector.html_feature_extractor.generate_language_switch()

login_field_detector.html_feature_extractor.get_xpath(element): Generate XPath for a given BeautifulSoup element.

login_field_detector.html_feature_extractor.is_item_visible(tag)

login_field_detector.html_feature_extractor.preprocess_field(tag): Preprocess an HTML token to include text, parent, sibling, and metadata.

login_field_detector.html_fetcher module

class login_field_detector.html_fetcher.HTMLFetcher(cache_dir=None, ttl=604800, max_concurrency=4)

Bases: object

fetch_all(urls, force=False, screenshot=False)

Synchronously fetch multiple URLs.

Parameters:

urls – List of URLs to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take screenshots of the pages.

Returns:

A dictionary mapping URLs to their HTML content.

fetch_html(url, force=False, screenshot=False)

Synchronously fetch a single URL.

Parameters:

url – URL to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take a screenshot of the page.

Returns:

HTML content as a string or None if failed.

Module contents

class login_field_detector.HTMLFeatureExtractor(label2id, oauth_providers=None)

Bases: object

get_features(html_text): Extract tokens, labels, xpaths, and bounding boxes from an HTML file.

class login_field_detector.HTMLFetcher(cache_dir=None, ttl=604800, max_concurrency=4)

Bases: object

fetch_all(urls, force=False, screenshot=False)

Synchronously fetch multiple URLs.

Parameters:

urls – List of URLs to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take screenshots of the pages.

Returns:

A dictionary mapping URLs to their HTML content.

fetch_html(url, force=False, screenshot=False)

Synchronously fetch a single URL.

Parameters:

url – URL to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take a screenshot of the page.

Returns:

HTML content as a string or None if failed.

class login_field_detector.LoginFieldDetector(model_dir=None, labels=['UNLABELED', 'USERNAME', 'PHONE_NUMBER', 'PASSWORD', 'LOGIN_BUTTON', 'TWO_FACTOR_AUTH', 'SOCIAL_LOGIN_BUTTONS', 'CAPTCHA', 'LANGUAGE_SWITCH', 'FORGOT_PASSWORD', 'SIGN_UP', 'REMEMBER_ME', 'HELP_LINK', 'PRIVACY_POLICY', 'TERMS_OF_SERVICE', 'NAVIGATION_LINK', 'BANNER', 'ADVERTISEMENTS', 'COOKIE_POLICY', 'IMPRINT'], device=None)

Bases: object

Model for login field detection using BERT.

create_dataset(inputs, labels): Align 1D labels with tokenized inputs.

evaluate(dataset): Evaluate the model on a dataset and plot confusion matrix.

plot_confusion_matrix(true_labels, pred_labels): Plot confusion matrix with improved x-axis label spacing.

predict(url=None, html_content=None, probability_threshold=0.9)

Make predictions on new HTML content.

Allowing multiple entries per label above a specified probability threshold and sorted by probability.

process_urls(urls, force=False, o_label_ratio=0.5, screenshots=False)

Preprocess the urls for training the model and balance the data

Parameters:

urls – List of URLs to process.
force – Force re-fetching all URLs.
screenshots – Whether you want to screenshot the urls being processed
o_label_ratio – Ratio of ‘UNLABELED’ labels to retain.

Returns:

Filtered inputs and labels.

train(urls=None, epochs=10, batch_size=16, force=False, screenshots=False)

Train the model.

Parameters:

urls – List of URLs to fetch and process for training.
epochs – Number of training epochs.
batch_size – Batch size for training.
screenshots – Whether you want to screenshot the urls being processed
force – Force re-fetching and reprocessing all URLs.

visualize_class_distribution(labels): Plot class distribution.

login_field_detector.determine_label(tag): Determine the label of an HTML tag based on patterns.