__init__.py |
Detect the encoding of the given byte string.
:param byte_str: The byte sequence to examine.
:type byte_str: ``bytes`` or ``bytearray``
|
3271 |
big5freq.py |
|
31254 |
big5prober.py |
|
1757 |
chardistribution.py |
reset analyser, clear any state |
9411 |
charsetgroupprober.py |
|
3839 |
charsetprober.py |
We define three types of bytes:
alphabet: english alphabets [a-zA-Z]
international: international characters [\x80-\xFF]
marker: everything else [^a-zA-Z\x80-\xFF]
The input buffer can be thought to contain a series of words delimited
by markers. This function works to filter all words that contain at
least one international character. All contiguous sequences of markers
are replaced by a single space ascii character.
This filter applies to all scripts which do not use English characters.
|
5110 |
cli |
|
|
codingstatemachine.py |
A state machine to verify a byte sequence for a particular encoding. For
each byte the detector receives, it will feed that byte to every active
state machine available, one byte at a time. The state machine changes its
state based on its previous state and the byte it receives. There are 3
states in a state machine that are of interest to an auto-detector:
START state: This is the state to start with, or a legal byte sequence
(i.e. a valid code point) for character has been identified.
ME state: This indicates that the state machine identified a byte sequence
that is specific to the charset it is designed for and that
there is no other possible encoding which can contain this byte
sequence. This will to lead to an immediate positive answer for
the detector.
ERROR state: This indicates the state machine identified an illegal byte
sequence for that encoding. This will lead to an immediate
negative answer for this encoding. Detector will exclude this
encoding from consideration from here on.
|
3590 |
compat.py |
|
1200 |
cp949prober.py |
|
1855 |
enums.py |
All of the Enums that are used throughout the chardet package.
:author: Dan Blanchard (dan.blanchard@gmail.com)
|
1661 |
escprober.py |
This CharSetProber uses a "code scheme" approach for detecting encodings,
whereby easily recognizable escape or shift sequences are relied on to
identify these encodings.
|
3950 |
escsm.py |
|
10510 |
eucjpprober.py |
|
3749 |
euckrfreq.py |
|
13546 |
euckrprober.py |
|
1748 |
euctwfreq.py |
|
31621 |
euctwprober.py |
|
1747 |
gb2312freq.py |
|
20715 |
gb2312prober.py |
|
1754 |
hebrewprober.py |
|
13838 |
jisfreq.py |
|
25777 |
jpcntx.py |
|
19643 |
langbulgarianmodel.py |
|
105685 |
langgreekmodel.py |
|
99559 |
langhebrewmodel.py |
|
98764 |
langhungarianmodel.py |
|
102486 |
langrussianmodel.py |
|
131168 |
langthaimodel.py |
|
103300 |
langturkishmodel.py |
|
95934 |
latin1prober.py |
|
5370 |
mbcharsetprober.py |
MultiByteCharSetProber
|
3413 |
mbcsgroupprober.py |
|
2012 |
mbcssm.py |
|
25481 |
metadata |
|
|
sbcharsetprober.py |
|
6136 |
sbcsgroupprober.py |
|
4309 |
sjisprober.py |
|
3774 |
universaldetector.py |
Module containing the UniversalDetector detector class, which is the primary
class a user of ``chardet`` should use.
:author: Mark Pilgrim (initial port to Python)
:author: Shy Shalom (original C code)
:author: Dan Blanchard (major refactoring for 3.0)
:author: Ian Cordasco
|
12503 |
utf8prober.py |
|
2766 |
version.py |
This module exists only to simplify retrieving the version number of chardet
from within setup.py and from chardet subpackages.
:author: Dan Blanchard (dan.blanchard@gmail.com)
|
242 |