Skip to main content
Ctrl+K
🦜🔗 LangChain  documentation - Home

Site Navigation

  • Core
  • Langchain
  • Text Splitters
  • AI21
  • Airbyte
    • Anthropic
    • AstraDB
    • AWS
    • Azure Dynamic Sessions
    • Chroma
    • Cohere
    • Couchbase
    • Elasticsearch
    • Exa
    • Fireworks
    • Google Community
    • Google GenAI
    • Google VertexAI
    • Groq
    • Huggingface
    • Milvus
    • MistralAI
    • MongoDB
    • Nomic
    • Nvidia Ai Endpoints
    • Ollama
    • OpenAI
    • Pinecone
    • Postgres
    • Prompty
    • Qdrant
    • Robocorp
    • Together
    • Unstructured
    • VoyageAI
    • Weaviate
  • LangChain docs
  • GitHub
  • X / Twitter

Site Navigation

  • Core
  • Langchain
  • Text Splitters
  • AI21
  • Airbyte
    • Anthropic
    • AstraDB
    • AWS
    • Azure Dynamic Sessions
    • Chroma
    • Cohere
    • Couchbase
    • Elasticsearch
    • Exa
    • Fireworks
    • Google Community
    • Google GenAI
    • Google VertexAI
    • Groq
    • Huggingface
    • Milvus
    • MistralAI
    • MongoDB
    • Nomic
    • Nvidia Ai Endpoints
    • Ollama
    • OpenAI
    • Pinecone
    • Postgres
    • Prompty
    • Qdrant
    • Robocorp
    • Together
    • Unstructured
    • VoyageAI
    • Weaviate
  • LangChain docs
  • GitHub
  • X / Twitter

Section Navigation

  • agents
  • beta
  • caches
  • callbacks
  • chat_history
  • chat_loaders
  • chat_sessions
  • document_loaders
  • documents
  • embeddings
  • example_selectors
  • exceptions
  • globals
  • graph_vectorstores
  • indexing
  • language_models
  • load
  • memory
  • messages
  • output_parsers
  • outputs
  • prompt_values
  • prompts
  • rate_limiters
  • retrievers
  • runnables
  • stores
  • structured_query
  • sys_info
  • tools
  • tracers
  • utils
    • NoLock
    • Tee
    • aclosing
    • atee
    • StrictFormatter
    • FunctionDescription
    • ToolDescription
    • NoLock
    • Tee
    • safetee
    • ChevronError
    • abatch_iterate
    • py_anext
    • tee_peer
    • env_var_is_set
    • get_from_dict_or_env
    • get_from_env
    • convert_to_openai_function
    • convert_to_openai_tool
    • tool_example_to_messages
    • extract_sub_links
    • find_all_links
    • encode_image
    • image_to_data_url
    • get_bolded_text
    • get_color_mapping
    • get_colored_text
    • print_text
    • is_interactive_env
    • batch_iterate
    • tee_peer
    • parse_and_check_json_markdown
    • parse_json_markdown
    • parse_partial_json
    • dereference_refs
    • grab_literal
    • l_sa_check
    • parse_tag
    • r_sa_check
    • render
    • tokenize
    • get_pydantic_major_version
    • is_basemodel_instance
    • is_basemodel_subclass
    • is_pydantic_v1_subclass
    • is_pydantic_v2_subclass
    • pre_init
    • comma_list
    • stringify_dict
    • stringify_value
    • build_extra_kwargs
    • check_package_version
    • convert_to_secret_str
    • get_pydantic_field_names
    • guard_import
    • mock_now
    • raise_for_status_with_text
    • xor_args
    • convert_pydantic_to_openai_function
    • convert_pydantic_to_openai_tool
    • convert_python_function_to_openai_function
    • format_tool_to_openai_function
    • format_tool_to_openai_tool
    • try_load_from_hub
  • vectorstores
  • langchain_core 0.2.29
  • utils
  • extract_sub_links

extract_sub_links#

langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: str | None = None, pattern: str | Pattern | None = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = (), continue_on_failure: bool = False) → List[str][source]#

Extract all links from a raw HTML string and convert into absolute paths.

Parameters:
  • raw_html (str) – original HTML.

  • url (str) – the url of the HTML.

  • base_url (str | None) – the base URL to check for outside links against.

  • pattern (str | Pattern | None) – Regex to use for extracting links from raw HTML.

  • prevent_outside (bool) – If True, ignore external links which are not children of the base URL.

  • exclude_prefixes (Sequence[str]) – Exclude any URLs that start with one of these prefixes.

  • continue_on_failure (bool) – If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.

Returns:

sub links.

Return type:

List[str]

previous

tool_example_to_messages

next

find_all_links

On this page
  • extract_sub_links()

© Copyright 2023, LangChain Inc.