login_field_detector package
Submodules
login_field_detector.field_detector module
- class login_field_detector.field_detector.LoginFieldDetector(model_dir=None, labels=['UNLABELED', 'USERNAME', 'PHONE_NUMBER', 'PASSWORD', 'LOGIN_BUTTON', 'TWO_FACTOR_AUTH', 'SOCIAL_LOGIN_BUTTONS', 'CAPTCHA', 'LANGUAGE_SWITCH', 'FORGOT_PASSWORD', 'SIGN_UP', 'REMEMBER_ME', 'HELP_LINK', 'PRIVACY_POLICY', 'TERMS_OF_SERVICE', 'NAVIGATION_LINK', 'BANNER', 'ADVERTISEMENTS', 'COOKIE_POLICY', 'IMPRINT'], device=None)
Bases:
objectModel for login field detection using BERT.
- create_dataset(inputs, labels)
Align 1D labels with tokenized inputs.
- evaluate(dataset)
Evaluate the model on a dataset and plot confusion matrix.
- plot_confusion_matrix(true_labels, pred_labels)
Plot confusion matrix with improved x-axis label spacing.
- predict(url=None, html_content=None, probability_threshold=0.9)
Make predictions on new HTML content.
Allowing multiple entries per label above a specified probability threshold and sorted by probability.
- process_urls(urls, force=False, o_label_ratio=0.5, screenshots=False)
Preprocess the urls for training the model and balance the data
- Parameters:
urls – List of URLs to process.
force – Force re-fetching all URLs.
screenshots – Whether you want to screenshot the urls being processed
o_label_ratio – Ratio of ‘UNLABELED’ labels to retain.
- Returns:
Filtered inputs and labels.
- train(urls=None, epochs=10, batch_size=16, force=False, screenshots=False)
Train the model.
- Parameters:
urls – List of URLs to fetch and process for training.
epochs – Number of training epochs.
batch_size – Batch size for training.
screenshots – Whether you want to screenshot the urls being processed
force – Force re-fetching and reprocessing all URLs.
- visualize_class_distribution(labels)
Plot class distribution.
- class login_field_detector.field_detector.WeightedTrainer(class_weights, device, *args, **kwargs)
Bases:
TrainerCustom Trainer with weighted loss for imbalanced classes.
- compute_loss(model, inputs, return_outputs=False, **kwargs)
How the loss is computed by Trainer. By default, all models return the loss in the first element.
Subclass and override for custom behavior.
- login_field_detector.field_detector.compute_metrics(pred)
- login_field_detector.field_detector.download_model_files(root_dir)
Downloads the necessary model files from Hugging Face Hub.
Returns paths to the downloaded model and tokenizer files.
login_field_detector.html_feature_extractor module
- class login_field_detector.html_feature_extractor.HTMLFeatureExtractor(label2id, oauth_providers=None)
Bases:
object- get_features(html_text)
Extract tokens, labels, xpaths, and bounding boxes from an HTML file.
- login_field_detector.html_feature_extractor.determine_label(tag)
Determine the label of an HTML tag based on patterns.
- login_field_detector.html_feature_extractor.generate_language_switch()
- login_field_detector.html_feature_extractor.get_xpath(element)
Generate XPath for a given BeautifulSoup element.
- login_field_detector.html_feature_extractor.is_item_visible(tag)
- login_field_detector.html_feature_extractor.preprocess_field(tag)
Preprocess an HTML token to include text, parent, sibling, and metadata.
login_field_detector.html_fetcher module
- class login_field_detector.html_fetcher.HTMLFetcher(cache_dir=None, ttl=604800, max_concurrency=4)
Bases:
object- fetch_all(urls, force=False, screenshot=False)
Synchronously fetch multiple URLs.
- Parameters:
urls – List of URLs to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take screenshots of the pages.
- Returns:
A dictionary mapping URLs to their HTML content.
- fetch_html(url, force=False, screenshot=False)
Synchronously fetch a single URL.
- Parameters:
url – URL to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take a screenshot of the page.
- Returns:
HTML content as a string or None if failed.
Module contents
- class login_field_detector.HTMLFeatureExtractor(label2id, oauth_providers=None)
Bases:
object- get_features(html_text)
Extract tokens, labels, xpaths, and bounding boxes from an HTML file.
- class login_field_detector.HTMLFetcher(cache_dir=None, ttl=604800, max_concurrency=4)
Bases:
object- fetch_all(urls, force=False, screenshot=False)
Synchronously fetch multiple URLs.
- Parameters:
urls – List of URLs to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take screenshots of the pages.
- Returns:
A dictionary mapping URLs to their HTML content.
- fetch_html(url, force=False, screenshot=False)
Synchronously fetch a single URL.
- Parameters:
url – URL to fetch.
force – Force fetching even if URLs are cached or marked as failed.
screenshot – Whether to take a screenshot of the page.
- Returns:
HTML content as a string or None if failed.
- class login_field_detector.LoginFieldDetector(model_dir=None, labels=['UNLABELED', 'USERNAME', 'PHONE_NUMBER', 'PASSWORD', 'LOGIN_BUTTON', 'TWO_FACTOR_AUTH', 'SOCIAL_LOGIN_BUTTONS', 'CAPTCHA', 'LANGUAGE_SWITCH', 'FORGOT_PASSWORD', 'SIGN_UP', 'REMEMBER_ME', 'HELP_LINK', 'PRIVACY_POLICY', 'TERMS_OF_SERVICE', 'NAVIGATION_LINK', 'BANNER', 'ADVERTISEMENTS', 'COOKIE_POLICY', 'IMPRINT'], device=None)
Bases:
objectModel for login field detection using BERT.
- create_dataset(inputs, labels)
Align 1D labels with tokenized inputs.
- evaluate(dataset)
Evaluate the model on a dataset and plot confusion matrix.
- plot_confusion_matrix(true_labels, pred_labels)
Plot confusion matrix with improved x-axis label spacing.
- predict(url=None, html_content=None, probability_threshold=0.9)
Make predictions on new HTML content.
Allowing multiple entries per label above a specified probability threshold and sorted by probability.
- process_urls(urls, force=False, o_label_ratio=0.5, screenshots=False)
Preprocess the urls for training the model and balance the data
- Parameters:
urls – List of URLs to process.
force – Force re-fetching all URLs.
screenshots – Whether you want to screenshot the urls being processed
o_label_ratio – Ratio of ‘UNLABELED’ labels to retain.
- Returns:
Filtered inputs and labels.
- train(urls=None, epochs=10, batch_size=16, force=False, screenshots=False)
Train the model.
- Parameters:
urls – List of URLs to fetch and process for training.
epochs – Number of training epochs.
batch_size – Batch size for training.
screenshots – Whether you want to screenshot the urls being processed
force – Force re-fetching and reprocessing all URLs.
- visualize_class_distribution(labels)
Plot class distribution.
- login_field_detector.determine_label(tag)
Determine the label of an HTML tag based on patterns.