mistokenize in English

[Show JSON for postprocessed kaikki.org data shown on this page ▼] [Hide JSON for postprocessed kaikki.org data shown on this page ▲]

{
  "etymology_templates": [
    {
      "args": {
        "1": "en",
        "2": "mis",
        "3": "tokenize"
      },
      "expansion": "mis- + tokenize",
      "name": "prefix"
    }
  ],
  "etymology_text": "From mis- + tokenize.",
  "forms": [
    {
      "form": "mistokenizes",
      "tags": [
        "present",
        "singular",
        "third-person"
      ]
    },
    {
      "form": "mistokenizing",
      "tags": [
        "participle",
        "present"
      ]
    },
    {
      "form": "mistokenized",
      "tags": [
        "participle",
        "past"
      ]
    },
    {
      "form": "mistokenized",
      "tags": [
        "past"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {},
      "expansion": "mistokenize (third-person singular simple present mistokenizes, present participle mistokenizing, simple past and past participle mistokenized)",
      "name": "en-verb"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "verb",
  "senses": [
    {
      "categories": [
        {
          "kind": "other",
          "name": "English entries with incorrect language header",
          "parents": [],
          "source": "w"
        },
        {
          "kind": "other",
          "name": "English terms prefixed with mis-",
          "parents": [],
          "source": "w"
        },
        {
          "kind": "other",
          "name": "Pages with 1 entry",
          "parents": [],
          "source": "w"
        },
        {
          "kind": "other",
          "name": "Pages with entries",
          "parents": [],
          "source": "w"
        }
      ],
      "examples": [
        {
          "bold_text_offsets": [
            [
              116,
              127
            ]
          ],
          "ref": "1994, Mark A. Terribile, Practical C++, page 109:",
          "text": "C++ considers ::*, .* and ->* each to be a single token and a single operator. Some pre-Release 2.0 implementations mistokenize expressions involving pointer-to-pointer-to-member.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              177,
              189
            ]
          ],
          "ref": "2016, Hans-Jörg Schmid, Entrenchment and the Psychology of Language Learning, page 111:",
          "text": "These were mostly proper names, such as Ronny Johnsen, or foreign language items such as ambre solaire (French) and fairie queene (Middle English), as well as a few misspelt or mistokenized items.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              96,
              109
            ]
          ],
          "ref": "2020 July, Anupama M Nair, Anusha Aji Justus, Arjun Ramesh, Binu Rajan M.R., “Event Extraction from Emails”, in International Journal of Computer Applications, volume 176, number 41:",
          "text": "The sentences were tokenized into words using the regex tokenizer which avoided the problems of mistokenizing while using the default NLTK tokenizer.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              78,
              90
            ]
          ],
          "ref": "2022, Toni Sivula, Deep Neural Networks in Drug-Target Activity Prediction and Machine Learning Assisted Docking of Ultra-Large Compound Libraries (Master's thesis):",
          "text": "Similarly, all other two-character atomic representations in SMILES are being mistokenized.",
          "type": "quotation"
        }
      ],
      "glosses": [
        "To tokenize incorrectly."
      ],
      "id": "en-mistokenize-en-verb-MhgRxlYs",
      "links": [
        [
          "tokenize",
          "tokenize"
        ]
      ],
      "related": [
        {
          "word": "mislexicalize"
        },
        {
          "word": "misparse"
        }
      ]
    }
  ],
  "word": "mistokenize"
}

[Show JSON for raw wiktextract data ▼] [Hide JSON for raw wiktextract data ▲]

{
  "etymology_templates": [
    {
      "args": {
        "1": "en",
        "2": "mis",
        "3": "tokenize"
      },
      "expansion": "mis- + tokenize",
      "name": "prefix"
    }
  ],
  "etymology_text": "From mis- + tokenize.",
  "forms": [
    {
      "form": "mistokenizes",
      "tags": [
        "present",
        "singular",
        "third-person"
      ]
    },
    {
      "form": "mistokenizing",
      "tags": [
        "participle",
        "present"
      ]
    },
    {
      "form": "mistokenized",
      "tags": [
        "participle",
        "past"
      ]
    },
    {
      "form": "mistokenized",
      "tags": [
        "past"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {},
      "expansion": "mistokenize (third-person singular simple present mistokenizes, present participle mistokenizing, simple past and past participle mistokenized)",
      "name": "en-verb"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "verb",
  "related": [
    {
      "word": "mislexicalize"
    },
    {
      "word": "misparse"
    }
  ],
  "senses": [
    {
      "categories": [
        "English entries with incorrect language header",
        "English lemmas",
        "English terms prefixed with mis-",
        "English terms with quotations",
        "English verbs",
        "Pages with 1 entry",
        "Pages with entries"
      ],
      "examples": [
        {
          "bold_text_offsets": [
            [
              116,
              127
            ]
          ],
          "ref": "1994, Mark A. Terribile, Practical C++, page 109:",
          "text": "C++ considers ::*, .* and ->* each to be a single token and a single operator. Some pre-Release 2.0 implementations mistokenize expressions involving pointer-to-pointer-to-member.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              177,
              189
            ]
          ],
          "ref": "2016, Hans-Jörg Schmid, Entrenchment and the Psychology of Language Learning, page 111:",
          "text": "These were mostly proper names, such as Ronny Johnsen, or foreign language items such as ambre solaire (French) and fairie queene (Middle English), as well as a few misspelt or mistokenized items.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              96,
              109
            ]
          ],
          "ref": "2020 July, Anupama M Nair, Anusha Aji Justus, Arjun Ramesh, Binu Rajan M.R., “Event Extraction from Emails”, in International Journal of Computer Applications, volume 176, number 41:",
          "text": "The sentences were tokenized into words using the regex tokenizer which avoided the problems of mistokenizing while using the default NLTK tokenizer.",
          "type": "quotation"
        },
        {
          "bold_text_offsets": [
            [
              78,
              90
            ]
          ],
          "ref": "2022, Toni Sivula, Deep Neural Networks in Drug-Target Activity Prediction and Machine Learning Assisted Docking of Ultra-Large Compound Libraries (Master's thesis):",
          "text": "Similarly, all other two-character atomic representations in SMILES are being mistokenized.",
          "type": "quotation"
        }
      ],
      "glosses": [
        "To tokenize incorrectly."
      ],
      "links": [
        [
          "tokenize",
          "tokenize"
        ]
      ]
    }
  ],
  "word": "mistokenize"
}

This page is a part of the kaikki.org machine-readable English dictionary. This dictionary is based on structured data extracted on 2026-07-25 from the enwiktionary dump dated 2026-07-06 using wiktextract (d9fa233 and 9e92f4b). The data shown on this site has been post-processed and various details (e.g., extra categories) removed, some information disambiguated, and additional data merged from other sources. See the raw data download page for the unprocessed wiktextract data.

If you use this data in academic research, please cite Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured Data, Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022. Linking to the relevant page(s) under https://kaikki.org would also be greatly appreciated.

"mistokenize" meaning in English

Verb

Inflected forms