Number Systems
a way of representing numbers using a specific base (radix). Each digit in a number has a position, and its value depends on the base of the system.
1. The (Decimal) or (Base-10) or (Denary) System
This is the system we use daily. It has 10 digits: $0, 1, 2, 3, 4, 5, 6, 7, 8, 9$ The value of a number depends on its position.
Example: $7392$
- $7 × 10^3$ (thousands place)
- $3 × 10^2$ (hundreds place)
- $9 × 10^1$ (tens place)
- $2 × 10^0$ (ones place)
So
$7392 = 7 × 1000 + 3 × 100 + 9 × 10 + 2 × 1 = 7392$
2. The (Binary) or (Base-2) System
Computers use binary (base-2) because they operate with only two states: 0 and 1. Every binary number is expressed in terms of powers of 2
Example: $11010.11$
$1 × 2^4 + 1 × 2^3 + 0 × 2^2 + 1 × 2^1 + 0 × 2^0 + 1 × 2^{-1} + 1 × 2^{-2}$
Calculating step by step:
- $1 × 16 = 16 $
- $1 × 8 = 8 $
- $0 × 4 = 0 $
- $1 × 2 = 2 $
- $0 × 1 = 0 $
- $1 × 0.5 = 0.5 $
- $1 × 0.25 = 0.25 $
Adding them up:
$16 + 8 + 0 + 2 + 0 + 0.5 + 0.25 = 26.75$
$(11010.11)₂ = (26.75)₁₀$
3. The (Hexadecimal) or (Base-16) System
The hexadecimal system is another number system used in computing. It is called Base-16 because it has 16 possible digits:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F
- The letters A to F represent numbers 10 to 15 in decimal.
Example: Convert (B65F)₁₆ to Decimal
Each digit is multiplied by a power of 16 (the base):
$B65F₁₆ = (11 × 16^3) + (6 × 16^2) + (5 × 16^1) + (15 × 16^0)$
Calculating step by step:
- $11 × 4096 = 45056 $
- $6 × 256 = 1536 $
- $5 × 16 = 80 $
- $15 × 1 = 15$
Now add them up:
$45056 + 1536 + 80 + 15 = 46687$
$(B65F)₁₆ = (46687)₁₀$
4. Other Number Systems
Base (Radix) | Name | Uses |
---|---|---|
Base-3 | Ternary | Some theoretical computing models and balanced ternary systems. |
Base-4 | Quaternary | Used in some genetic algorithms and quantum computing research. |
Base-5 | Quinary | Found in some ancient cultures and abacus systems. |
Base-6 | Senary (Hexary) | Some mathematical applications, used by the human hand (counting with one hand using the thumb). |
Base-7 | Septenary | Rarely used, but found in some ancient calendars. |
Base-8 | Octal | Used in early computing, file permissions in UNIX, and compact representation of binary. |
Base-9 | Nonary | Occasionally used in some mathematical puzzles. |
Base-11 | Undecimal | Rare, but has been considered in some numeral systems. |
Base-12 | Duodecimal (Dozenal) | Used in traditional measurement systems (dozen, feet-inches) and advocated for better fraction division. |
Base-20 | Vigesimal | Used by the Mayan and Aztec civilizations, also seen in some languages for counting. |
Base-24 | Quadrovigesimal | Used in some encoding and encryption techniques. |
Base-32 | Duotrigesimal | Used in Base32 encoding (data encoding in computing). |
Base-36 | Hexatrigesimal | Used in URL shorteners and compact representation of alphanumeric IDs. |
Base-60 | Sexagesimal | Used in time (60 seconds, 60 minutes), angles (360 degrees), and ancient Babylonian mathematics. |
5. Unusual & Specialized Bases
Base | Name | Uses |
---|---|---|
Base-64 | Base64 Encoding | Used in data encoding for transferring files and email attachments. |
Base-256 | Extended Binary | Used in character encoding (ASCII, Unicode). |
Base-1 | Unary | Used in tally marks and theoretical computing. |
Important Powers of 2 (in Computing)
Computers use binary, so powers of 2 are important in memory, storage, and networking.
For example:
Power | Value (Bytes/Bits) | Common Name | Range It Can Store |
---|---|---|---|
2⁰ | 1 bit | Smallest unit of data | 0 to 1 |
2¹ | 2 bits | 0 to 3 | |
2² | 4 bits | 1 Nibble | 0 to 15 |
2³ | 8 bits | 1 Byte or 1 Octet | 0 to 255 |
2⁴ | 16 bits | 2 Bytes (Half-word) | 0 to 65,535 |
2⁵ | 32 bits | Word in 32-bit CPUs | 0 to 4,294,967,295 |
2⁶ | 64 bits | Double-word | 0 to 18,446,744,073,709,551,615 |
2⁸ | 256 bytes | 0 to 2⁶⁴ - 1 | |
2¹⁰ | 1,024 bytes | 1 Kilobyte | 0 to 2⁸⁰ - 1 |
2²⁰ | 1,048,576 bytes | 1 Megabyte | 0 to 2¹⁶⁰ - 1 |
2³⁰ | 1,073,741,824 bytes | 1 Gigabyte | 0 to 2²⁴⁰ - 1 |
2⁴⁰ | 1,099,511,627,776 bytes | 1 Terabyte | 0 to 2³²⁰ - 1 |
Megabit (Mb) vs. Megabyte (MB)
One of the biggest confusions in tech and internet speed marketing comes from the difference between Mb (megabit) and MB (megabyte).
1 Byte = 8 Bits
- 1 Megabit (Mb) = 1,000,000 bits (ISPs use this to advertise speed)
- 1 Megabyte (MB) = 8 Megabits (Mb)
Since ISPs measure in bits, they make the speed seem 8× larger than what you actually get in megabytes per second (MB/s).
How ISPs Trick You
-
When an ISP advertises “100 Mbps”, they mean 100 Megabits per second (Mb/s).
-
Since 1 MB = 8 Mb, your actual download speed is:
100 Mbps ÷ 8 = 12.5 MB/s
Arithmetic Operations in Binary
Binary arithmetic follows the same basic principles as decimal arithmetic but operates with only two digits: 0 and 1. The four fundamental operations are:
- Binary Addition
- Binary Subtraction
- Binary Multiplication
- Binary Division
1. Binary Addition
Binary addition follows simple rules:
Binary Addition Rule | Decimal Equivalent |
---|---|
0 + 0 = 0 | 0 + 0 = 0 |
0 + 1 = 1 | 0 + 1 = 1 |
1 + 0 = 1 | 1 + 0 = 1 |
1 + 1 = 10 (write 0, carry 1) | 1 + 1 = 2 |
1 + 1 + 1 = 11 (write 1, carry 1) | 1 + 1 + 1 = 3 |
Example of Binary Addition
Let’s add $1011₂ (11)₁₀$ and $1101₂ (13)₁₀$
1011
+ 1101
--------
11000 (24)₁₀
Steps:
-
$1 + 1 = 10$ → write 0, carry 1
-
$(1 + 0) + (1 $ we carried$) = 10$ → write 0, carry 1
-
$(0 + 1) + (1 $ we carried$) = 10$ → write 0, carry 1
-
$(1 + 1) + (1 $ we carried$) = 11$
Result: $11000₂$
2. Binary Subtraction
Binary subtraction follows these rules:
Binary Subtraction Rule | Decimal Equivalent |
---|---|
0 - 0 = 0 | 0 - 0 = 0 |
1 - 0 = 1 | 1 - 0 = 1 |
1 - 1 = 0 | 1 - 1 = 0 |
0 - 1 = (write 1, borrow 1) | 0 - 1 = -1 |
Example of Binary Subtraction
Let’s subtract $1011₂ (11)₁₀ - 110₂ (6)₁₀$
1011
- 0110
--------
0101 (5)₁₀
Steps:
- $1 - 0 = 1$
- $1 - 1 = 0$
- $0 - 1 = 1$ → write 1, borrow 1
- $(1 - 0) - (1 $ we borrowed$) = 0$
Result: $0101₂$
3. Binary Multiplication
Binary multiplication is like decimal multiplication but follows simpler rules:
Binary Multiplication Rule | Decimal Equivalent |
---|---|
0 × 0 = 0 | 0 × 0 = 0 |
0 × 1 = 0 | 0 × 1 = 0 |
1 × 0 = 0 | 1 × 0 = 0 |
1 × 1 = 1 | 1 × 1 = 1 |
Example of Binary Multiplication
Let’s multiply $101₂ (5)₁₀ × 11₂ (3)₁₀$
101 (5)
× 11 (3)
------
101 (this is 101 × 1)
+ 101 (this is 101 × 1, shifted left by 1)
------
1111 (15)₁₀
Steps:
- Multiply $101 × 1 = 101$
- Multiply $101 × 1 = 1010$ , shift left
- Result: $101 + 1010 = (1111)₂$
4. Binary Division
Binary division is similar to long division in decimal.
Example of Binary Division
Divide 1010₂ (10₁₀) ÷ 10₂ (2₁₀):
101 (5₁₀)
------
10 | 1010
-10
----
01 (Bring down 0)
-10
----
0
The quotient is $101₂$
Number Base Conversions
In computing, numbers can be represented in various bases (radices), when converting between these systems, the key is to ensure that the representations are equivalent in decimal form.
For example:
- (0011)₈ (Octal) and (1001)₂ (Binary) are equivalent because both represent the decimal number 9.
Converting Decimal to Base-r
Method: Repeated Division by the Target Base
- Divide the number by the target base (r).
- Record the remainder.
- Use the quotient for the next division.
- Repeat until the quotient is 0.
- The Answer is the remainders read from bottom to top, left to right.
Example 1.1: Convert (41)₁₀ to Binary (Base-2)
41 ÷ 2 = 20 remainder 1
20 ÷ 2 = 10 remainder 0
10 ÷ 2 = 5 remainder 0
5 ÷ 2 = 2 remainder 1
2 ÷ 2 = 1 remainder 0
1 ÷ 2 = 0 remainder 1
Answer: $(41)₁₀ = (101001)₂$
Example 1.2: Convert (153)₁₀ to Octal (Base-8)
153 ÷ 8 = 19 remainder 1
19 ÷ 8 = 2 remainder 3
2 ÷ 8 = 0 remainder 2
Answer: $(153)₁₀ = (231)₈$
Converting Decimal Fractions to Another Base
Method: Repeated Multiplication by the Target Base
- Multiply the fraction by the target base.
- Record the integer part of the result.
- Use the fractional part for the next multiplication.
- Repeat until the fraction becomes 0 (or to the desired precision).
- The Answer is the integer parts read from top to bottom.
Example 1.3: Convert (0.6875)₁₀ to Binary (Base-2)
0.6875 × 2 = 1.375 → integer part = 1
0.375 × 2 = 0.75 → integer part = 0
0.75 × 2 = 1.5 → integer part = 1
0.5 × 2 = 1.0 → integer part = 1 (stops here)
Answer: $(0.6875)₁₀ = (0.1011)₂$
Example 1.4: Convert (0.513)₁₀ to Octal (Base-8)
0.513 × 8 = 4.104 → integer part = 4
0.104 × 8 = 0.832 → integer part = 0
0.832 × 8 = 6.656 → integer part = 6
0.656 × 8 = 5.248 → integer part = 5
0.248 × 8 = 1.984 → integer part = 1
0.984 × 8 = 7.872 → integer part = 7
Answer: $(0.513)₁₀ ≈ (0.406517)₈$
Converting Mixed Decimal Numbers to Another Base
When dealing with numbers that have both integer and fractional parts, convert them separately and then combine the results.
Example 1.5: Convert (41.6875)₁₀ to Binary
Using previous results:
- $(41)₁₀ → (101001)₂$
- $(0.6875)₁₀ → (0.1011)₂$
Answer: $(41.6875)₁₀ = (101001.1011)₂$
Example 1.6: Convert (153.513)₁₀ to Octal
Using previous results:
- $(153)₁₀ → (231)₈$
- $(0.513)₁₀ → (0.406517)₈$
Answer: $(153.513)₁₀ ≈ (231.406517)₈$
Binary to Octal Conversion
Example: Convert (101100110110)₂ to Octal
- Group into threes: $(101 100 110 110)₂$
- Convert each group:
- $101 \to 5$
- $100 \to 4$
- $110 \to 6$
- $110 \to 6$
- Result: $(5466)₈$
For fractions, group digits to the right of the binary point.
Example: Convert (101100.110)₂ to Octal
- Group into threes: $(101 100 . 110)₂$
- Convert:
- $101 \to 5$
- $100 \to 4$
- $.110 \to .6$
- Result: $(54.6)₈$
Binary to Hexadecimal Conversion
Example: Convert (101100110110)₂ to Hexadecimal
- Group into fours: $(1011 0011 0110)₂$
- Convert each group:
- $1011 \to B$
- $0011 \to 3$
- $0110 \to 6$
- Result: $(B36)₁₆$
For fractions, group digits to the right of the binary point.
Example: Convert (101100.1101)₂ to Hexadecimal
- Group into fours: $(1011 00.1101)₂$
- Convert:
- $1011 \to B$
- $00.1101 \to 0.D$
- Result: $(B0.D)₁₆$
Octal & Hexadecimal to Binary Conversion
Conversion from octal or hexadecimal to binary is just the reverse process:
- Replace each octal digit with its 3-bit binary equivalent.
- Replace each hexadecimal digit with its 4-bit binary equivalent.
Example: Convert (306.D)₁₆ to Binary
- $3 \to 0011$
- $0 \to 0000$
- $6 \to 0110$
- $D \to 1101$
Result: $(0011 0000 0110.1101)₂$
Example: Convert (673.124)₈ to Binary
- $6 \to 110$
- $7 \to 111$
- $3 \to 011$
- $.1 \to 001$
- $2 \to 010$
- $4 \to 100$
Result: $(110 111 011.001 010 100)₂$
Compact Representation in Computing
Binary numbers are essential for computers but are too long for humans to work with efficiently. This is why:
- Octal is 1/3 the length of binary (every 3 binary digits = 1 octal digit).
- Hexadecimal is 1/4 the length of binary (every 4 binary digits = 1 hex digit).
Example: Compact Representations
- Binary: (1111 1111 1111)₂ (12 digits)
- Octal: (7777)₈ (4 digits)
- Hexadecimal: (FFF)₁₆ (3 digits)
Because of this compactness, many computer manuals and digital system designs prefer hexadecimal over octal since:
- Each byte (8 bits) is represented as two hex digits.
- Hex fits better with modern architectures, which commonly use 8-bit, 16-bit, 32-bit, and 64-bit data representations.
Table: Number System Equivalents (0-15)
Decimal (10) | Binary (2) | Octal (8) | Hexadecimal (16) |
---|---|---|---|
0 | 0000 | 00 | 0 |
1 | 0001 | 01 | 1 |
2 | 0010 | 02 | 2 |
3 | 0011 | 03 | 3 |
4 | 0100 | 04 | 4 |
5 | 0101 | 05 | 5 |
6 | 0110 | 06 | 6 |
7 | 0111 | 07 | 7 |
8 | 1000 | 10 | 8 |
9 | 1001 | 11 | 9 |
10 | 1010 | 12 | A |
11 | 1011 | 13 | B |
12 | 1100 | 14 | C |
13 | 1101 | 15 | D |
14 | 1110 | 16 | E |
15 | 1111 | 17 | F |
Signed Integer
-
A signed integer can represent both positive and negative numbers.
-
The most significant bit (MSB) is used as the sign bit. If the sign bit is
1
, the number is negative. If it is0
, the number is positive.
The range of a signed integer depends on its size:
- A signed 8-bit integer can represent values from
-128
to127
. - A signed 16-bit integer can represent values from
-32,768
to32,767
. - A signed 32-bit integer can represent values from
-2,147,483,648
to2,147,483,647
.
1. Signed-Magnitude Representation
- Use the leftmost bit (MSB) as sign bit
- The remaining bits hold the absolute value (magnitude)
Example: 8-bit
+9
=00001001
-9
=10001001
Notice: only the sign bit flipped
Problems:
- Two zeros:
00000000
= +010000000
= -0
2. Signed-1’s Complement Representation
- Positive numbers are normal.
- To get a negative number: flip all bits (bitwise NOT), including the sign bit.
Example:
+9
=00001001
-9
=11110110
← flip every bit
Problems:
- Still two zeros:
00000000
= +011111111
= -0
- Adds require end-around carry
3. Signed-2’s Complement Representation
- Positive numbers are the same.
- Negative number = 2’s complement of the positive one.
That means:
-9 = Flip all bits of +9, then add 1
Example:
+9 = 00001001
NOT = 11110110 ← 1’s complement
+1 = 11110111 ← 2’s complement = -9
Advantages:
- Only one zero:
00000000
- Arithmetic is super easy: same circuits for signed and unsigned
- This is what real CPUs use
Table Breakdown
Decimal | 2’s Comp | 1’s Comp | Signed-Mag |
---|---|---|---|
+7 | 0111 | 0111 | 0111 |
+6 | 0110 | 0110 | 0110 |
… | … | … | … |
+0 | 0000 | 0000 | 0000 |
-0 | ❌ | 1111 | 1000 |
-1 | 1111 | 1110 | 1001 |
-2 | 1110 | 1101 | 1010 |
-7 | 1001 | 1000 | 1111 |
-8 | 1000 | ❌ | ❌ |
Unsigned Integer
-
An unsigned integer can only represent non-negative numbers (zero and positive numbers).
-
There is no sign bit, so all bits are used to represent the magnitude of the number.
The range of an unsigned integer is larger than that of a signed integer of the same size:
- An unsigned 8-bit integer can represent values from
0
to255
. - An unsigned 16-bit integer can represent values from
0
to65,535
. - An unsigned 32-bit integer can represent values from
0
to4,294,967,295
.
For example:
- In many programming languages, the
int
keyword is used to declare an integer variable. By default,int
is usually signed, and If you want to explicitly declare an unsigned integer, you use theunsigned
keyword.
Complements
Complements are mathematical techniques used in computers to:
- Simplify subtraction by converting it to addition.
- Represent negative numbers essential for signed arithmetic.
- Enable efficient hardware design No separate subtraction circuit needed (uses same adder).
Types of Complements
1. Diminished Radix Complement
For a number $N$ in base $r$ with $n$ digits, the $(r−1)’s$ complement is:
$(r^n - 1) - N$
Decimal (9’s Complement)
$Base\ r = 10 → (r−1) = 9 → 9’s\ complement$
$9’s\ complement\ of\ N = (10^n - 1) - N$
Since $( 10^n - 1 )$ is just a number with $n\ 9s$ Subtract each digit of $N$ from $9$
Examples:
- 9’s complement of $546700$ = $999999 - 546700 = 453299$
- 9’s complement of $012398$ = $999999 - 012398 = 987601$
Binary (1’s Complement)
$Base\ r = 2 → (r−1) = 1 → 1’s complement$
$1’s\ complement\ of\ N = (2^n - 1) - N$
Since $( 2^n - 1 )$ is just a number with $n\ 1s$
Flip each bit:
- $ 1 → 0$
- $ 0 → 1$
Examples:
- 1’s complement of $1011000 = 0100111$
- 1’s complement of $0101101 = 1010010$
Octal & Hexadecimal
- In octal (base 8): subtract each digit from
7
- In hexadecimal (base 16): subtract each digit from
F
(decimal 15)
2. Radix Complement
The $r’s\ complement$ of an $n$-digit number $N$ in base $r$ is
$r^n - N \ for\ N > 0, \ and\ 0\ if\ N = 0$
It is equivalent to:
$r’s complement\ = (r−1)’s\ complement\ + 1$
Decimal (10’s Complement)
$Base\ r = 10$
$10’s\ complement\ of\ N\ = (10^n - N)$
since $r^n - N = [(r^n - 1) - N] + 1$ Thus, the 10’s complement is obtained by adding $1$ to the $9’s$ complement value.
Binary (2’s Complement)
$Base\ r = 2$
$2’s\ complement\ of\ N\ = (2^n - N)$
Equivalent to:
$1’s complement\ of\ N\ + 1$
Notes
-
If $N$ includes a radix point, remove it before computing the complement, and restore it afterward in the same position.
-
Double complement returns original value:
- Complement of (Complement of N) = N
Subtraction with Complements
Here’s the logic:
Instead of subtracting $N$, you add its complement
Add the minuend $M$ to the r’s complement of the subtrahend $N$
$M + (r^n - N) = M - N + r^n$
Now two possibilities:
-
If M ≥ N → the sum will be ≥ rⁿ, and the carry happens. You discard the carry, and the rest is the correct answer.
-
If M < N → the sum is < rⁿ, so no carry happens, and the result is in r’s complement form (a negative number in disguise). You take the r’s complement again to find its magnitude and place a negative sign in front.
Example 1.1
Using 10’s complement, subtract $72532 - 3250$
$M = 72532$
$N = 03250$ (we pad it to 5 digits)
$r⁵ = 100000$ So 10’s complement of $03250 = 100000 - 03250 = 96750$
- Why this? cuz adding $96750$ to $M$ is like subtracting $03250$ (but easier for the computer).
$M$ + 10’s complement of $N = 72532 + 96750 = 169282$
Check for carry
The result is $> 100000$ → carry occurred → $M ≥ N$ So discard the carry (subtract 100000):
$169282 - 100000 = 69282$
Final answer: $69282$
- Why discard the carry? Because it’s the extra $rⁿ$ we added earlier to help convert subtraction to addition. It’s not part of the actual difference.
Example 1.2
Using 10’s complement, subtract $3250 - 72532$
We need to make sure both numbers have the same number of digits to work with complements.
$M = 3250$
We pad it with leading zeros to make it 5 digits: $M = 03250$
$N = 72532$
This already has 5 digits, so no padding needed.
To subtract using complements, we first need to find the 10’s complement of $N$. The 10’s complement is found by subtracting the number from $10^n$ (where $n$ is the number of digits).
$N = 72532$ (5 digits)
$10^5 = 100000$
So, the 10’s complement of N is:
- $10^5 - N = 100000 - 72532 = 27468$
Now, we add $M$ to the 10’s complement of $N$:
The sum is: $30718$
Check for End Carry
-
If there is an end carry, we discard it and take the result as positive.
-
If there is no end carry, we know that the result is negative, and we take the complement of the result to find the negative value.
So,
-
The sum is $30718$, and there is no end carry (the sum is less than $100000$). Take the 10’s Complement of the Result
-
$10^5 - 30718 = 100000 - 30718 = 69282$
Now, we place a negative sign in front of the result: $−69282$
Example 1.3
Given the two binary numbers $X = 1010100$ and $Y = 1000011$
perform the subtraction
$X - Y$ and $Y - X$ by using 2’s complements.
X - Y
Find 2’s complement of $Y = 1000011$
So $-Y = 0111101$
then add $X$ and $-Y$
Sum $= 10010001$
Discard end carry $2^7 = 10000000$
Answer: $X - Y =$ Sum $- 10000000 = 0010001$
Y - X
2’s complement of $X = 1010100$
$-X = 0101100$
then add $Y$ and $-X$
Sum $= 1101110$
There is no end carry. Therefore, the answer is $Y - X = -(1’s$ complement of $1101110) = -0010001$
Discard End Carry (in 2’s complement)
If there’s a carry-out (bit too big to fit), you just throw it away.
Why? Because in 2’s complement, the carry-out isn’t needed, the result is already correct due to how 2’s complement encodes negatives.
Example (4-bit):
0110 = 6
+ 1101 = -3 (in 2’s comp)
-------
1 0011 ← Carry-out = 1
Discard carry: result $= 0011 = 3$
End-Around Carry (in 1’s complement)
Instead of discarding the carry, you add it back into the result.
Why? Because 1’s complement doesn’t fully represent negatives. Adding back the carry fixes the incomplete math.
Example (4-bit):
0110 = 6
+ 1100 = 1’s comp of 3
-------
1 0010 ← Carry-out = 1
Add carry back:
- $0010 + 1 = 0011 = 3$
Endianness
The word “endianness” itself comes from the term “endian”, which was coined in reference to Gulliver’s Travels by Jonathan Swift. In the book, there is a fictional conflict between two factions: one that breaks their eggs at the small end and another that breaks them at the large end. This is an allegory about a seemingly trivial difference leading to a huge conflict.
In computing, “endianness” refers to the way in which multi-byte data is stored in memory — whether the most significant byte (big-endian) or the least significant byte (little-endian) is stored first.
- Big-endian: “The big end of the egg first” — the most significant byte is stored first.
- Little-endian: “The little end of the egg first” — the least significant byte is stored first.
What is Endianness In computing?
Endianness refers to the byte order in which data is stored in memory. When you have multi-byte values (like a 32-bit integer), the system needs to decide how to arrange the bytes in memory. There are two common types of endianness:
1. Big Endian
In big-endian systems, the most significant byte (MSB) is stored at the lowest memory address (i.e., at the beginning of the data). So, for a multi-byte value like a 32-bit integer, the most significant part of the number (the leftmost byte) comes first.
Example:
Let’s take the 32-bit integer 0x01020304
.
- In Big Endian, the bytes would be stored in memory like this:
Address: 0x00 0x01 0x02 0x03
Value: 0x01 0x02 0x03 0x04
In this case, 0x01
is the most significant byte (MSB) and is stored at the lowest memory address (0x00).
2. Little Endian
In little-endian systems, the least significant byte (LSB) is stored at the lowest memory address. This means that for multi-byte values, the rightmost byte (the least significant byte) comes first.
Example:
Now, let’s look at the same 32-bit integer 0x01020304
.
- In Little Endian, the bytes would be stored in memory like this:
Address: 0x00 0x01 0x02 0x03
Value: 0x04 0x03 0x02 0x01
Here, 0x04
(the least significant byte) is stored at the lowest memory address (0x00).
Why does Endianness Matter?
When data is transmitted over a network or stored in memory, it’s important to know how the bytes are ordered. If two systems (like a server and a client) have different endianness formats, misinterpretation of the data could occur, leading to errors or bugs.
For example:
- A network protocol may require data to be sent in big-endian format (also called network order).
- But your local computer (host) might use little-endian format.
This mismatch could cause problems unless the data is correctly converted to the expected byte order on both ends.
Network Order vs Host Order:
- Network Order: This refers to the big-endian format, as it’s the standard used in most network protocols (TCP/IP).
- Host Order: This refers to the endianness used by your specific machine (could be either little-endian or big-endian depending on the architecture, x86 uses little-endian, while SPARC uses big-endian).
Endianness in Action:
- When you’re sending data over a network (such as sending a 32-bit integer), you’ll need to ensure the byte order is correct.
- For example, if you’re working on a system that uses little-endian (like most x86 machines) but you’re communicating with a system that expects big-endian (like some older network protocols), you would need to convert the data into big-endian before sending it.
Example: Converting Between Endianess
Most programming languages provide functions to handle byte order conversion. For example, in C:
htonl()
andhtons()
(host to network, long and short) convert data to network order (big-endian).ntohl()
andntohs()
(network to host, long and short) convert data from network order to host order.
These functions ensure that data is sent in the right order, no matter what endianness the host system uses.
Middle Endian?
The middle-endian format (sometimes called PDP-endian) is a rare, older format used in some systems (like the PDP-11). It uses a hybrid of big and little endian and is not commonly seen today, so you typically don’t need to worry about it.
ASCII
The Basics of Early Encoding
ASCII stands for the American Standard Code for Information Interchange.
-
How ASCII works:
ASCII uses a 7-bit binary system, which means it can represent 128 characters (from 0 to 127). This was enough for the English language, including letters, digits, punctuation, and a few control characters (like newline, tab, and carriage return). -
Example:
- The letter A in ASCII is represented as
65
in decimal or41
in hexadecimal. In binary, it’s01000001
. - The letter B is
66
in decimal,42
in hexadecimal, and01000010
in binary.
- The letter A in ASCII is represented as
Here’s the issue: ASCII can only represent 128 characters. This isn’t enough to represent characters from all languages around the world, especially when languages have thousands of characters (Chinese, Japanese, Arabic, etc.).
Code Pages
To solve this limitation of 128 characters, we get Code Pages. A Code Page is essentially a mapping of values (numbers) to characters. The idea is to use the unused values (128–255) from the 8-bit range (1 byte) to add more characters to the encoding.
- Example of Code Pages:
- ISO/IEC 8859-1 (also called Latin-1) is a code page for Western European languages (like French, German, Spanish, etc.). It uses values from 128 to 255 to represent accented characters like é or ñ.
- Windows-1252 is another code page used for Western European languages.
The issue with code pages is that they are limited to a region. For example, Windows-1251 is used for Cyrillic (Russian), and it won’t work for Chinese, Japanese, or Arabic characters. Each code page is designed for a specific language group, and you can’t easily switch between them.
Multibyte Character Sets – Representing Non-Latin Languages
For languages that have more than 256 characters (like Chinese, Japanese, and Korean), you need a way to represent thousands of characters. This is where Multibyte Character Sets come into play.
-
What is Multibyte?
Instead of using 1 byte (8 bits) per character, we use multiple bytes to represent a single character. So, one character might take 2 bytes, 3 bytes, or even more depending on the language and character. -
Why Multibyte?
If you only had 256 possible values (1 byte), you couldn’t represent the thousands of characters needed for languages like Chinese. By using multibyte encoding, you can represent a much larger set of characters.
Examples of Multibyte Encoding:
- Shift-JIS: This is a multibyte encoding for Japanese characters. It uses both 1 byte (for ASCII characters) and 2 bytes (for kanji, hiragana, and katakana characters).
- GB2312: This is a multibyte encoding for Simplified Chinese. It uses 2 bytes for most characters.
However, multibyte character sets are not perfect, and they can be tricky to handle because:
- They don’t always define how to handle byte sequences correctly.
- They are not universal and are region-dependent (for example, Shift-JIS is for Japanese, not for other languages).
Unicode
Unicode is a universal character encoding standard. Its goal is to represent every character in every language ever written, including ancient languages (like Egyptian hieroglyphs) and constructed languages (like Klingon or Esperanto).
Unicode Code Points
Each character in Unicode is assigned a unique identifier called a code point. A code point is usually written in the format U+XXXX
, where XXXX
is the hexadecimal value of the code point.
- Example:
- The letter A in Unicode is
U+0041
(the same as ASCII). - The Chinese character 字 (meaning “character” in Chinese) has the Unicode code point
U+5B57
. - The emoji 😀 has the Unicode code point
U+1F600
.
- The letter A in Unicode is
Unicode Encodings – How to Store Characters**
There are different encodings that determine how Unicode characters are stored in memory or files. The most common are UTF-8, UTF-16, and UTF-32.
UTF-8
- Variable-Length Encoding: UTF-8 uses 1 to 4 bytes to represent a character.
- Characters in the ASCII range (U+0000 to U+007F) use 1 byte, making UTF-8 compatible with ASCII.
- Non-ASCII characters (like Chinese or emoji) use 2, 3, or 4 bytes.
- Advantage: It’s space-efficient for English text and backward-compatible with ASCII.
- Example:
- The letter A (
U+0041
) in UTF-8 is just0x41
(1 byte). - The Chinese character 字 (
U+5B57
) in UTF-8 is0xE5 0xAD 0x97
(3 bytes).
- The letter A (
UTF-16
- Fixed-Length Encoding for Most Characters: UTF-16 uses 2 bytes for most characters. However, it can use 4 bytes for some rare characters (those outside the Basic Multilingual Plane).
- Advantage: UTF-16 is more efficient for languages with a lot of characters, like Chinese or Japanese, because it uses fewer bytes for these characters compared to UTF-8.
- Example:
- The letter A (
U+0041
) in UTF-16 is0x0041
(2 bytes). - The Chinese character 字 (
U+5B57
) in UTF-16 is0x5B57
(2 bytes).
- The letter A (
UTF-32
- Fixed-Length Encoding: UTF-32 uses 4 bytes for every character, no matter how simple or complex it is.
- Advantage: Every character is represented by exactly 4 bytes, making it easy to calculate the size of a string.
- Disadvantage: It’s very space-inefficient because even simple characters like “A” use 4 bytes.
- Example:
- The letter A (
U+0041
) in UTF-32 is0x00000041
(4 bytes). - The Chinese character 字 (
U+5B57
) in UTF-32 is0x00005B57
(4 bytes).
- The letter A (
Variable Binary Length Data
In many communication protocols, data doesn’t always fit neatly into a fixed-size format. Imagine you have a username for authentication. Some usernames might be very short (like “Bob”), while others could be much longer (like “SuperUltraMegaUser123”). If we try to store these in a fixed-length format (always 16 characters), we end up wasting space for shorter usernames or might run out of space for longer usernames. This is where variable-length data comes in.
- Why use variable-length data?
It’s more efficient because it uses exactly as much space as needed for the data, instead of wasting space or being too restrictive.
Now, protocols that transmit data over a network often need to handle these variable-length data values efficiently. Let’s dive into the different strategies used to handle variable-length data.
Terminated Data (A Common Method)
One of the most common methods for handling variable-length data is to use terminators. This means that each data element has a special marker (a terminating symbol) that tells the system where the data ends. You’ve probably encountered this with strings where there is a special symbol like a null character (NUL) to mark the end of the string.
- Why use a terminator?
The reason is simple: a terminator is a clear signal that the data has ended, which makes it easy to know where to stop reading. Without a terminator, the protocol wouldn’t know where the data ends, leading to confusion or incomplete data reading.
Example of a Terminated String (NUL-terminated):
Consider the string “Hello”:
- Data:
H e l l o
- Hexadecimal representation:
0x48 0x65 0x6C 0x6C 0x6F
- NUL terminator:
0x00
Here’s how it works:
- The string “Hello” is stored as a sequence of bytes, and the NUL value
0x00
marks the end of the string.
Why does this work?
- The NUL character is not typically part of a string, so it’s safe to use it as a marker to end the string. When the protocol reads data, it stops reading as soon as it encounters
0x00
(the NUL terminator). This tells the system, “Okay, that’s the end of the string!”
Why does this approach work well?
- The protocol doesn’t need to know ahead of time how long the string is. It just reads until it sees the NUL terminator.
What if the terminator (NUL) appears in the string?
- Problem: If the data itself contains a
0x00
byte (for example, in some binary data), it could prematurely signal the end of the string. - Solution: Escape characters are used to prevent confusion. For example, if the string contains a
0x00
, it might be escaped with a special symbol like a backslash (\0
) or repeated0x00
to signal that it’s not the terminator but part of the string.
Bounded Data (Start and End Markers)
Another method for handling variable-length data is bounded data, where you specify a start and end marker. In this case, the data is surrounded by specific characters that mark where the data begins and ends.
Example: Quoted String
Consider a string "Hello"
that is bounded by double quotes. This means the protocol knows that:
-
The data starts with a
"
(quote character), and -
The data ends with the matching
"
(quote character). -
Data:
"Hello"
-
Hexadecimal representation:
0x22 0x48 0x65 0x6C 0x6C 0x6F 0x22
Here, the double quotes 0x22
mark the beginning and end of the string.
- The start quote signals where the string begins.
- The end quote signals where the string ends.
Why use bounded data?
- Predictability: The protocol always knows exactly where the string starts and ends, so it doesn’t need to rely on special terminators like NUL.
- Flexibility: You can use any symbols as delimiters (quotes, braces, etc.), which is very useful in some protocols (like JSON).
Problem with Bounded Data?
- If the boundary symbols (like quotes) appear inside the data itself, the protocol would have trouble distinguishing between data and the boundary symbol.
- Solution: Escape the boundary symbols using special characters. For example,
"Hello "Bob""
would escape the quote inside the string as\"
.
Why is this approach used?
- Human-readable format: It’s easier to interpret because it often mimics common syntax in programming languages (like JSON, XML, etc.), making it understandable for both computers and humans.
Length-Prefixed Data**
This is another strategy where the length of the data is specified before the data itself. This is often used in binary protocols because it’s efficient and avoids the ambiguity of terminators or boundaries.
Example:
Imagine sending a string like "Hello"
:
-
The length of the string (which is 5) is sent first, followed by the actual data.
-
Data format:
[Length][Data]
So it would look like this:- Length:
0x05
(5 bytes) - Data:
H e l l o
- Length:
-
Why Length-Prefixed?
This approach solves the problem of variable-length data by making sure the receiver always knows how much data to read, without needing terminators or boundary markers.
Why use this method?
- It’s very efficient because the protocol doesn’t need to scan the data for terminators or matching symbols. The length is specified right upfront, so it knows exactly how much to read.
- Predictable: No need for special symbols to mark the start or end of the data.
Implicit-Length Data**
Sometimes, the length of the data is implicit from the context. For example, if you’re transmitting a fixed-length header or using a fixed format, the protocol already knows how long each field will be.
Example:
A protocol might always expect the first 4 bytes to be a header, followed by a 10-byte username. If the protocol always knows that the username is 10 bytes long, it doesn’t need to specify the length each time.
-
The format could look like this:
- Header (4 bytes) + Username (10 bytes)
- You don’t need to explicitly state the length of the username since the protocol knows it’s always 10 bytes.
-
Why use Implicit-Length?
- It’s efficient because it doesn’t require sending additional length information, reducing overhead.
- It’s useful when data formats are fixed and predictable, which is common in low-level protocols.
Padded Data
In some cases, protocols require that data be padded to a specific length. This ensures the data always takes up a predictable number of bytes, which can be important for alignment or buffer management.
Example:
If a protocol expects data to be 16 bytes long, it might pad shorter data with extra bytes (often zeros) to ensure it’s always 16 bytes.
- Data:
"Hello"
- Padded Data:
"Hello"
+0x00 0x00 0x00 0x00 0x00
to make it 16 bytes.
Why padding?
- Padding ensures the data is a fixed size, which is important for buffer alignment in low-level systems.
- It can also help prevent buffer overflows or make the data easier to manage.