"byte pair encoding" meaning in English

See byte pair encoding in All languages combined, or Wiktionary

Noun

Forms: byte pair encodings [plural], byte-pair encoding [alternative]
Head templates: {{en-noun|~}} byte pair encoding (countable and uncountable, plural byte pair encodings)
  1. (computing) A lossless data compression algorithm that iteratively replaces the most frequent pair of adjacent bytes in a sequence with a new byte not already present in the data. Tags: countable, uncountable
    Sense id: en-byte_pair_encoding-en-noun-01IpLCeF Categories (other): Computing, English entries with incorrect language header, Pages with 1 entry, Pages with entries Disambiguation of English entries with incorrect language header: 75 25 Disambiguation of Pages with 1 entry: 74 26 Disambiguation of Pages with entries: 75 25 Topics: computing, engineering, mathematics, natural-sciences, physical-sciences, sciences
  2. (natural language processing) A subword tokenization method that iteratively merges the most frequent pairs of adjacent characters in a corpus to form longer and more meaningful tokens, typically until a predefined vocabulary size is reached. Tags: countable, uncountable
    Sense id: en-byte_pair_encoding-en-noun-cIQDccGg

Inflected forms

Alternative forms

{
  "forms": [
    {
      "form": "byte pair encodings",
      "tags": [
        "plural"
      ]
    },
    {
      "form": "byte-pair encoding",
      "tags": [
        "alternative"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {
        "1": "~"
      },
      "expansion": "byte pair encoding (countable and uncountable, plural byte pair encodings)",
      "name": "en-noun"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "noun",
  "senses": [
    {
      "categories": [
        {
          "kind": "other",
          "langcode": "en",
          "name": "Computing",
          "orig": "en:Computing",
          "parents": [],
          "source": "w"
        },
        {
          "_dis": "75 25",
          "kind": "other",
          "name": "English entries with incorrect language header",
          "parents": [],
          "source": "w+disamb"
        },
        {
          "_dis": "74 26",
          "kind": "other",
          "name": "Pages with 1 entry",
          "parents": [],
          "source": "w+disamb"
        },
        {
          "_dis": "75 25",
          "kind": "other",
          "name": "Pages with entries",
          "parents": [],
          "source": "w+disamb"
        }
      ],
      "glosses": [
        "A lossless data compression algorithm that iteratively replaces the most frequent pair of adjacent bytes in a sequence with a new byte not already present in the data."
      ],
      "id": "en-byte_pair_encoding-en-noun-01IpLCeF",
      "links": [
        [
          "computing",
          "computing#Noun"
        ],
        [
          "lossless",
          "lossless"
        ],
        [
          "compression",
          "compression"
        ],
        [
          "algorithm",
          "algorithm"
        ],
        [
          "iteratively",
          "iteratively"
        ],
        [
          "frequent",
          "frequent"
        ],
        [
          "adjacent",
          "adjacent"
        ],
        [
          "sequence",
          "sequence"
        ],
        [
          "data",
          "data"
        ]
      ],
      "raw_glosses": [
        "(computing) A lossless data compression algorithm that iteratively replaces the most frequent pair of adjacent bytes in a sequence with a new byte not already present in the data."
      ],
      "tags": [
        "countable",
        "uncountable"
      ],
      "topics": [
        "computing",
        "engineering",
        "mathematics",
        "natural-sciences",
        "physical-sciences",
        "sciences"
      ]
    },
    {
      "glosses": [
        "A subword tokenization method that iteratively merges the most frequent pairs of adjacent characters in a corpus to form longer and more meaningful tokens, typically until a predefined vocabulary size is reached."
      ],
      "id": "en-byte_pair_encoding-en-noun-cIQDccGg",
      "links": [
        [
          "subword",
          "subword"
        ],
        [
          "tokenization",
          "tokenization"
        ],
        [
          "merge",
          "merge"
        ],
        [
          "corpus",
          "corpus"
        ],
        [
          "token",
          "token"
        ],
        [
          "vocabulary",
          "vocabulary"
        ],
        [
          "reach",
          "reach"
        ]
      ],
      "qualifier": "natural language processing",
      "raw_glosses": [
        "(natural language processing) A subword tokenization method that iteratively merges the most frequent pairs of adjacent characters in a corpus to form longer and more meaningful tokens, typically until a predefined vocabulary size is reached."
      ],
      "tags": [
        "countable",
        "uncountable"
      ]
    }
  ],
  "word": "byte pair encoding"
}
{
  "categories": [
    "English countable nouns",
    "English entries with incorrect language header",
    "English lemmas",
    "English multiword terms",
    "English nouns",
    "English uncountable nouns",
    "Pages with 1 entry",
    "Pages with entries"
  ],
  "forms": [
    {
      "form": "byte pair encodings",
      "tags": [
        "plural"
      ]
    },
    {
      "form": "byte-pair encoding",
      "tags": [
        "alternative"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {
        "1": "~"
      },
      "expansion": "byte pair encoding (countable and uncountable, plural byte pair encodings)",
      "name": "en-noun"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "noun",
  "senses": [
    {
      "categories": [
        "en:Computing"
      ],
      "glosses": [
        "A lossless data compression algorithm that iteratively replaces the most frequent pair of adjacent bytes in a sequence with a new byte not already present in the data."
      ],
      "links": [
        [
          "computing",
          "computing#Noun"
        ],
        [
          "lossless",
          "lossless"
        ],
        [
          "compression",
          "compression"
        ],
        [
          "algorithm",
          "algorithm"
        ],
        [
          "iteratively",
          "iteratively"
        ],
        [
          "frequent",
          "frequent"
        ],
        [
          "adjacent",
          "adjacent"
        ],
        [
          "sequence",
          "sequence"
        ],
        [
          "data",
          "data"
        ]
      ],
      "raw_glosses": [
        "(computing) A lossless data compression algorithm that iteratively replaces the most frequent pair of adjacent bytes in a sequence with a new byte not already present in the data."
      ],
      "tags": [
        "countable",
        "uncountable"
      ],
      "topics": [
        "computing",
        "engineering",
        "mathematics",
        "natural-sciences",
        "physical-sciences",
        "sciences"
      ]
    },
    {
      "glosses": [
        "A subword tokenization method that iteratively merges the most frequent pairs of adjacent characters in a corpus to form longer and more meaningful tokens, typically until a predefined vocabulary size is reached."
      ],
      "links": [
        [
          "subword",
          "subword"
        ],
        [
          "tokenization",
          "tokenization"
        ],
        [
          "merge",
          "merge"
        ],
        [
          "corpus",
          "corpus"
        ],
        [
          "token",
          "token"
        ],
        [
          "vocabulary",
          "vocabulary"
        ],
        [
          "reach",
          "reach"
        ]
      ],
      "qualifier": "natural language processing",
      "raw_glosses": [
        "(natural language processing) A subword tokenization method that iteratively merges the most frequent pairs of adjacent characters in a corpus to form longer and more meaningful tokens, typically until a predefined vocabulary size is reached."
      ],
      "tags": [
        "countable",
        "uncountable"
      ]
    }
  ],
  "word": "byte pair encoding"
}

Download raw JSONL data for byte pair encoding meaning in English (2.1kB)


This page is a part of the kaikki.org machine-readable English dictionary. This dictionary is based on structured data extracted on 2025-10-28 from the enwiktionary dump dated 2025-10-21 using wiktextract (b9d36ff and 0a198a9). The data shown on this site has been post-processed and various details (e.g., extra categories) removed, some information disambiguated, and additional data merged from other sources. See the raw data download page for the unprocessed wiktextract data.

If you use this data in academic research, please cite Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured Data, Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022. Linking to the relevant page(s) under https://kaikki.org would also be greatly appreciated.