extract_sub_links#

langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: str | None = None, pattern: str | Pattern | None = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = (), continue_on_failure: bool = False) → List[str][source]#

Extract all links from a raw HTML string and convert into absolute paths.

Parameters:

raw_html (str) – original HTML.
url (str) – the url of the HTML.
base_url (str | None) – the base URL to check for outside links against.
pattern (str | Pattern | None) – Regex to use for extracting links from raw HTML.
prevent_outside (bool) – If True, ignore external links which are not children of the base URL.
exclude_prefixes (Sequence[str]) – Exclude any URLs that start with one of these prefixes.
continue_on_failure (bool) – If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.

Returns:

sub links.

Return type:

List[str]