Number Systems

a way of representing numbers using a specific base (radix). Each digit in a number has a position, and its value depends on the base of the system.

1. The (Decimal) or (Base-10) or (Denary) System

This is the system we use daily. It has 10 digits: $0, 1, 2, 3, 4, 5, 6, 7, 8, 9$ The value of a number depends on its position.

Example: $7392$

$7 × 10^3$ (thousands place)
$3 × 10^2$ (hundreds place)
$9 × 10^1$ (tens place)
$2 × 10^0$ (ones place)

$7392 = 7 × 1000 + 3 × 100 + 9 × 10 + 2 × 1 = 7392$

This concept applies to any number system but instead of 10 we use a different base.

2. The (Binary) or (Base-2) System

Computers use binary (base-2) because they operate with only two states: 0 and 1. Every binary number is expressed in terms of powers of 2

Example: $11010.11$

$1 × 2^4 + 1 × 2^3 + 0 × 2^2 + 1 × 2^1 + 0 × 2^0 + 1 × 2^{-1} + 1 × 2^{-2}$

Calculating step by step:

$1 × 16 = 16 $
$1 × 8 = 8 $
$0 × 4 = 0 $
$1 × 2 = 2 $
$0 × 1 = 0 $
$1 × 0.5 = 0.5 $
$1 × 0.25 = 0.25 $

Adding them up:

$16 + 8 + 0 + 2 + 0 + 0.5 + 0.25 = 26.75$

$(11010.11)₂ = (26.75)₁₀$

3. The (Hexadecimal) or (Base-16) System

The hexadecimal system is another number system used in computing. It is called Base-16 because it has 16 possible digits:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F

The letters A to F represent numbers 10 to 15 in decimal.

Example: Convert (B65F)₁₆ to Decimal

Each digit is multiplied by a power of 16 (the base):

$B65F₁₆ = (11 × 16^3) + (6 × 16^2) + (5 × 16^1) + (15 × 16^0)$

Calculating step by step:

$11 × 4096 = 45056 $
$6 × 256 = 1536 $
$5 × 16 = 80 $
$15 × 1 = 15$

Now add them up:

$45056 + 1536 + 80 + 15 = 46687$

$(B65F)₁₆ = (46687)₁₀$

4. Other Number Systems

Base (Radix)	Name	Uses
Base-3	Ternary	Some theoretical computing models and balanced ternary systems.
Base-4	Quaternary	Used in some genetic algorithms and quantum computing research.
Base-5	Quinary	Found in some ancient cultures and abacus systems.
Base-6	Senary (Hexary)	Some mathematical applications, used by the human hand (counting with one hand using the thumb).
Base-7	Septenary	Rarely used, but found in some ancient calendars.
Base-8	Octal	Used in early computing, file permissions in UNIX, and compact representation of binary.
Base-9	Nonary	Occasionally used in some mathematical puzzles.
Base-11	Undecimal	Rare, but has been considered in some numeral systems.
Base-12	Duodecimal (Dozenal)	Used in traditional measurement systems (dozen, feet-inches) and advocated for better fraction division.
Base-20	Vigesimal	Used by the Mayan and Aztec civilizations, also seen in some languages for counting.
Base-24	Quadrovigesimal	Used in some encoding and encryption techniques.
Base-32	Duotrigesimal	Used in Base32 encoding (data encoding in computing).
Base-36	Hexatrigesimal	Used in URL shorteners and compact representation of alphanumeric IDs.
Base-60	Sexagesimal	Used in time (60 seconds, 60 minutes), angles (360 degrees), and ancient Babylonian mathematics.

5. Unusual & Specialized Bases

Base	Name	Uses
Base-64	Base64 Encoding	Used in data encoding for transferring files and email attachments.
Base-256	Extended Binary	Used in character encoding (ASCII, Unicode).
Base-1	Unary	Used in tally marks and theoretical computing.

Important Powers of 2 (in Computing)

Computers use binary, so powers of 2 are important in memory, storage, and networking.
For example:

Power	Value (Bytes/Bits)	Common Name	Range It Can Store
2⁰	1 bit	Smallest unit of data	0 to 1
2¹	2 bits		0 to 3
2²	4 bits	1 Nibble	0 to 15
2³	8 bits	1 Byte or 1 Octet	0 to 255
2⁴	16 bits	2 Bytes (Half-word)	0 to 65,535
2⁵	32 bits	Word in 32-bit CPUs	0 to 4,294,967,295
2⁶	64 bits	Double-word	0 to 18,446,744,073,709,551,615
2⁸	256 bytes		0 to 2⁶⁴ - 1
2¹⁰	1,024 bytes	1 Kilobyte	0 to 2⁸⁰ - 1
2²⁰	1,048,576 bytes	1 Megabyte	0 to 2¹⁶⁰ - 1
2³⁰	1,073,741,824 bytes	1 Gigabyte	0 to 2²⁴⁰ - 1
2⁴⁰	1,099,511,627,776 bytes	1 Terabyte	0 to 2³²⁰ - 1

Megabit (Mb) vs. Megabyte (MB)

One of the biggest confusions in tech and internet speed marketing comes from the difference between Mb (megabit) and MB (megabyte).

1 Byte = 8 Bits

1 Megabit (Mb) = 1,000,000 bits (ISPs use this to advertise speed)
1 Megabyte (MB) = 8 Megabits (Mb)

Since ISPs measure in bits, they make the speed seem 8× larger than what you actually get in megabytes per second (MB/s).

How ISPs Trick You

When an ISP advertises “100 Mbps”, they mean 100 Megabits per second (Mb/s).
Since 1 MB = 8 Mb, your actual download speed is:

100 Mbps ÷ 8 = 12.5 MB/s

Arithmetic Operations in Binary

Binary arithmetic follows the same basic principles as decimal arithmetic but operates with only two digits: 0 and 1. The four fundamental operations are:

Binary Addition
Binary Subtraction
Binary Multiplication
Binary Division

1. Binary Addition

Binary addition follows simple rules:

Binary Addition Rule	Decimal Equivalent
0 + 0 = 0	0 + 0 = 0
0 + 1 = 1	0 + 1 = 1
1 + 0 = 1	1 + 0 = 1
1 + 1 = 10 (write 0, carry 1)	1 + 1 = 2
1 + 1 + 1 = 11 (write 1, carry 1)	1 + 1 + 1 = 3

Example of Binary Addition

Let’s add $1011₂ (11)₁₀$ and $1101₂ (13)₁₀$

   1011
+  1101
--------
  11000  (24)₁₀

Steps:

$1 + 1 = 10$ → write 0, carry 1
$(1 + 0) + (1 $ we carried$) = 10$ → write 0, carry 1
$(0 + 1) + (1 $ we carried$) = 10$ → write 0, carry 1
$(1 + 1) + (1 $ we carried$) = 11$

Result: $11000₂$

2. Binary Subtraction

Binary subtraction follows these rules:

Binary Subtraction Rule	Decimal Equivalent
0 - 0 = 0	0 - 0 = 0
1 - 0 = 1	1 - 0 = 1
1 - 1 = 0	1 - 1 = 0
0 - 1 = (write 1, borrow 1)	0 - 1 = -1

Example of Binary Subtraction

Let’s subtract $1011₂ (11)₁₀ - 110₂ (6)₁₀$

    1011
-   0110
--------
    0101  (5)₁₀

Steps:

$1 - 0 = 1$
$1 - 1 = 0$
$0 - 1 = 1$ → write 1, borrow 1
$(1 - 0) - (1 $ we borrowed$) = 0$

Result: $0101₂$

3. Binary Multiplication

Binary multiplication is like decimal multiplication but follows simpler rules:

Binary Multiplication Rule	Decimal Equivalent
0 × 0 = 0	0 × 0 = 0
0 × 1 = 0	0 × 1 = 0
1 × 0 = 0	1 × 0 = 0
1 × 1 = 1	1 × 1 = 1

Example of Binary Multiplication

Let’s multiply $101₂ (5)₁₀ × 11₂ (3)₁₀$

   101  (5)
×   11  (3)
------
   101   (this is 101 × 1)
+ 101    (this is 101 × 1, shifted left by 1)
------
  1111  (15)₁₀

Steps:

Multiply $101 × 1 = 101$
Multiply $101 × 1 = 1010$ , shift left
Result: $101 + 1010 = (1111)₂$

4. Binary Division

Binary division is similar to long division in decimal.

Example of Binary Division

Divide 1010₂ (10₁₀) ÷ 10₂ (2₁₀):

  101 (5₁₀)
------
10 | 1010
      -10
      ----
       01 (Bring down 0)
      -10
      ----
        0

The quotient is $101₂$

Number Base Conversions

In computing, numbers can be represented in various bases (radices), when converting between these systems, the key is to ensure that the representations are equivalent in decimal form.

For example:

(0011)₈ (Octal) and (1001)₂ (Binary) are equivalent because both represent the decimal number 9.

Converting Decimal to Base-r

Method: Repeated Division by the Target Base

Divide the number by the target base (r).
Record the remainder.
Use the quotient for the next division.
Repeat until the quotient is 0.
The Answer is the remainders read from bottom to top, left to right.

Example 1.1: Convert (41)₁₀ to Binary (Base-2)

41 ÷ 2 = 20  remainder 1
20 ÷ 2 = 10  remainder 0
10 ÷ 2 = 5   remainder 0
5  ÷ 2 = 2   remainder 1
2  ÷ 2 = 1   remainder 0
1  ÷ 2 = 0   remainder 1

Answer: $(41)₁₀ = (101001)₂$

Example 1.2: Convert (153)₁₀ to Octal (Base-8)

153 ÷ 8 = 19  remainder 1
19  ÷ 8 = 2   remainder 3
2   ÷ 8 = 0   remainder 2

Answer: $(153)₁₀ = (231)₈$

Converting Decimal Fractions to Another Base

Method: Repeated Multiplication by the Target Base

Multiply the fraction by the target base.
Record the integer part of the result.
Use the fractional part for the next multiplication.
Repeat until the fraction becomes 0 (or to the desired precision).
The Answer is the integer parts read from top to bottom.

Example 1.3: Convert (0.6875)₁₀ to Binary (Base-2)

0.6875 × 2  = 1.375   → integer part = 1
0.375  × 2  = 0.75    → integer part = 0
0.75   × 2  = 1.5     → integer part = 1
0.5    × 2  = 1.0     → integer part = 1 (stops here)

Answer: $(0.6875)₁₀ = (0.1011)₂$

Example 1.4: Convert (0.513)₁₀ to Octal (Base-8)

0.513 × 8 = 4.104   → integer part = 4
0.104 × 8 = 0.832   → integer part = 0
0.832 × 8 = 6.656   → integer part = 6
0.656 × 8 = 5.248   → integer part = 5
0.248 × 8 = 1.984   → integer part = 1
0.984 × 8 = 7.872   → integer part = 7

Answer: $(0.513)₁₀ ≈ (0.406517)₈$

Converting Mixed Decimal Numbers to Another Base

When dealing with numbers that have both integer and fractional parts, convert them separately and then combine the results.

Example 1.5: Convert (41.6875)₁₀ to Binary

Using previous results:

$(41)₁₀ → (101001)₂$
$(0.6875)₁₀ → (0.1011)₂$

Answer: $(41.6875)₁₀ = (101001.1011)₂$

Example 1.6: Convert (153.513)₁₀ to Octal

Using previous results:

$(153)₁₀ → (231)₈$
$(0.513)₁₀ → (0.406517)₈$

Answer: $(153.513)₁₀ ≈ (231.406517)₈$

Binary to Octal Conversion

Example: Convert (101100110110)₂ to Octal

Group into threes: $(101 100 110 110)₂$
Convert each group:

$101 \to 5$
$100 \to 4$
$110 \to 6$
$110 \to 6$

Result: $(5466)₈$

For fractions, group digits to the right of the binary point.

Example: Convert (101100.110)₂ to Octal

Group into threes: $(101 100 . 110)₂$
Convert:

$101 \to 5$
$100 \to 4$
$.110 \to .6$

Result: $(54.6)₈$

Binary to Hexadecimal Conversion

Example: Convert (101100110110)₂ to Hexadecimal

Group into fours: $(1011 0011 0110)₂$
Convert each group:

$1011 \to B$
$0011 \to 3$
$0110 \to 6$

Result: $(B36)₁₆$

For fractions, group digits to the right of the binary point.

Example: Convert (101100.1101)₂ to Hexadecimal

Group into fours: $(1011 00.1101)₂$
Convert:

$1011 \to B$
$00.1101 \to 0.D$

Result: $(B0.D)₁₆$

Octal & Hexadecimal to Binary Conversion

Conversion from octal or hexadecimal to binary is just the reverse process:

Replace each octal digit with its 3-bit binary equivalent.
Replace each hexadecimal digit with its 4-bit binary equivalent.

Example: Convert (306.D)₁₆ to Binary

$3 \to 0011$
$0 \to 0000$
$6 \to 0110$
$D \to 1101$

Result: $(0011 0000 0110.1101)₂$

Example: Convert (673.124)₈ to Binary

$6 \to 110$
$7 \to 111$
$3 \to 011$
$.1 \to 001$
$2 \to 010$
$4 \to 100$

Result: $(110 111 011.001 010 100)₂$

Compact Representation in Computing

Binary numbers are essential for computers but are too long for humans to work with efficiently. This is why:

Octal is 1/3 the length of binary (every 3 binary digits = 1 octal digit).
Hexadecimal is 1/4 the length of binary (every 4 binary digits = 1 hex digit).

Example: Compact Representations

Binary: (1111 1111 1111)₂ (12 digits)
Octal: (7777)₈ (4 digits)
Hexadecimal: (FFF)₁₆ (3 digits)

Because of this compactness, many computer manuals and digital system designs prefer hexadecimal over octal since:

Each byte (8 bits) is represented as two hex digits.
Hex fits better with modern architectures, which commonly use 8-bit, 16-bit, 32-bit, and 64-bit data representations.

Table: Number System Equivalents (0-15)

Decimal (10)	Binary (2)	Octal (8)	Hexadecimal (16)
0	0000	00	0
1	0001	01	1
2	0010	02	2
3	0011	03	3
4	0100	04	4
5	0101	05	5
6	0110	06	6
7	0111	07	7
8	1000	10	8
9	1001	11	9
10	1010	12	A
11	1011	13	B
12	1100	14	C
13	1101	15	D
14	1110	16	E
15	1111	17	F

Signed Integer

A signed integer can represent both positive and negative numbers.
The most significant bit (MSB) is used as the sign bit. If the sign bit is 1, the number is negative. If it is 0, the number is positive.

The range of a signed integer depends on its size:

A signed 8-bit integer can represent values from -128 to 127.
A signed 16-bit integer can represent values from -32,768 to 32,767.
A signed 32-bit integer can represent values from -2,147,483,648 to 2,147,483,647.

1. Signed-Magnitude Representation

Use the leftmost bit (MSB) as sign bit
The remaining bits hold the absolute value (magnitude)

Example: 8-bit

+9 = 00001001
-9 = 10001001

Notice: only the sign bit flipped

Problems:

Two zeros:
- 00000000 = +0
- 10000000 = -0

2. Signed-1’s Complement Representation

Positive numbers are normal.
To get a negative number: flip all bits (bitwise NOT), including the sign bit.

Example:

+9 = 00001001
-9 = 11110110 ← flip every bit

Problems:

Still two zeros:
- 00000000 = +0
- 11111111 = -0
Adds require end-around carry

3. Signed-2’s Complement Representation

Positive numbers are the same.
Negative number = 2’s complement of the positive one.

That means:

-9 = Flip all bits of +9, then add 1

Example:

+9  = 00001001
NOT = 11110110   ← 1’s complement
+1  = 11110111   ← 2’s complement = -9

Advantages:

Only one zero: 00000000
Arithmetic is super easy: same circuits for signed and unsigned
This is what real CPUs use

Table Breakdown

Decimal	2’s Comp	1’s Comp	Signed-Mag
+7	0111	0111	0111
+6	0110	0110	0110
…	…	…	…
+0	0000	0000	0000
-0	❌	1111	1000
-1	1111	1110	1001
-2	1110	1101	1010
-7	1001	1000	1111
-8	1000	❌	❌

Unsigned Integer

An unsigned integer can only represent non-negative numbers (zero and positive numbers).
There is no sign bit, so all bits are used to represent the magnitude of the number.

The range of an unsigned integer is larger than that of a signed integer of the same size:

An unsigned 8-bit integer can represent values from 0 to 255.
An unsigned 16-bit integer can represent values from 0 to 65,535.
An unsigned 32-bit integer can represent values from 0 to 4,294,967,295.

For example:

In many programming languages, the int keyword is used to declare an integer variable. By default, int is usually signed, and If you want to explicitly declare an unsigned integer, you use the unsigned keyword.

Complements

Complements are mathematical techniques used in computers to:

Simplify subtraction by converting it to addition.
Represent negative numbers essential for signed arithmetic.
Enable efficient hardware design No separate subtraction circuit needed (uses same adder).

Types of Complements

1. Diminished Radix Complement

For a number $N$ in base $r$ with $n$ digits, the $(r−1)’s$ complement is:

$(r^n - 1) - N$

Decimal (9’s Complement)

$Base\ r = 10 → (r−1) = 9 → 9’s\ complement$

$9’s\ complement\ of\ N = (10^n - 1) - N$

Since $( 10^n - 1 )$ is just a number with $n\ 9s$ Subtract each digit of $N$ from $9$

Examples:

9’s complement of $546700$ = $999999 - 546700 = 453299$
9’s complement of $012398$ = $999999 - 012398 = 987601$

Binary (1’s Complement)

$Base\ r = 2 → (r−1) = 1 → 1’s complement$

$1’s\ complement\ of\ N = (2^n - 1) - N$

Since $( 2^n - 1 )$ is just a number with $n\ 1s$

Flip each bit:

$ 1 → 0$
$ 0 → 1$

Examples:

1’s complement of $1011000 = 0100111$
1’s complement of $0101101 = 1010010$

Octal & Hexadecimal

In octal (base 8): subtract each digit from 7
In hexadecimal (base 16): subtract each digit from F (decimal 15)

2. Radix Complement

The $r’s\ complement$ of an $n$-digit number $N$ in base $r$ is

$r^n - N \ for\ N > 0, \ and\ 0\ if\ N = 0$

It is equivalent to:

$r’s complement\ = (r−1)’s\ complement\ + 1$

Decimal (10’s Complement)

$Base\ r = 10$

$10’s\ complement\ of\ N\ = (10^n - N)$

since $r^n - N = [(r^n - 1) - N] + 1$ Thus, the 10’s complement is obtained by adding $1$ to the $9’s$ complement value.

Binary (2’s Complement)

$Base\ r = 2$

$2’s\ complement\ of\ N\ = (2^n - N)$

Equivalent to:

$1’s complement\ of\ N\ + 1$

Notes

If $N$ includes a radix point, remove it before computing the complement, and restore it afterward in the same position.
Double complement returns original value:
- Complement of (Complement of N) = N

Subtraction with Complements

Here’s the logic:

Instead of subtracting $N$, you add its complement

Add the minuend $M$ to the r’s complement of the subtrahend $N$

$M + (r^n - N) = M - N + r^n$

Now two possibilities:

If M ≥ N → the sum will be ≥ rⁿ, and the carry happens. You discard the carry, and the rest is the correct answer.
If M < N → the sum is < rⁿ, so no carry happens, and the result is in r’s complement form (a negative number in disguise). You take the r’s complement again to find its magnitude and place a negative sign in front.

Example 1.1

Using 10’s complement, subtract $72532 - 3250$

$M = 72532$

$N = 03250$ (we pad it to 5 digits)

$r⁵ = 100000$ So 10’s complement of $03250 = 100000 - 03250 = 96750$

Why this? cuz adding $96750$ to $M$ is like subtracting $03250$ (but easier for the computer).

$M$ + 10’s complement of $N = 72532 + 96750 = 169282$

Check for carry

The result is $> 100000$ → carry occurred → $M ≥ N$ So discard the carry (subtract 100000):

$169282 - 100000 = 69282$

Final answer: $69282$

Why discard the carry? Because it’s the extra $rⁿ$ we added earlier to help convert subtraction to addition. It’s not part of the actual difference.

Example 1.2

Using 10’s complement, subtract $3250 - 72532$

We need to make sure both numbers have the same number of digits to work with complements.

$M = 3250$
We pad it with leading zeros to make it 5 digits: $M = 03250$
$N = 72532$
This already has 5 digits, so no padding needed.

To subtract using complements, we first need to find the 10’s complement of $N$. The 10’s complement is found by subtracting the number from $10^n$ (where $n$ is the number of digits).

$N = 72532$ (5 digits)

$10^5 = 100000$

So, the 10’s complement of N is:

$10^5 - N = 100000 - 72532 = 27468$

Now, we add $M$ to the 10’s complement of $N$:

The sum is: $30718$

Check for End Carry

If there is an end carry, we discard it and take the result as positive.
If there is no end carry, we know that the result is negative, and we take the complement of the result to find the negative value.

So,

The sum is $30718$, and there is no end carry (the sum is less than $100000$). Take the 10’s Complement of the Result
$10^5 - 30718 = 100000 - 30718 = 69282$

Now, we place a negative sign in front of the result: $−69282$

Example 1.3

Given the two binary numbers $X = 1010100$ and $Y = 1000011$
perform the subtraction $X - Y$ and $Y - X$ by using 2’s complements.

X - Y

Find 2’s complement of $Y = 1000011$

So $-Y = 0111101$

then add $X$ and $-Y$
Sum $= 10010001$

Discard end carry $2^7 = 10000000$
Answer: $X - Y =$ Sum $- 10000000 = 0010001$

Y - X

2’s complement of $X = 1010100$

$-X = 0101100$

then add $Y$ and $-X$
Sum $= 1101110$

There is no end carry. Therefore, the answer is $Y - X = -(1’s$ complement of $1101110) = -0010001$

Discard End Carry (in 2’s complement)

If there’s a carry-out (bit too big to fit), you just throw it away.

Why? Because in 2’s complement, the carry-out isn’t needed, the result is already correct due to how 2’s complement encodes negatives.

Example (4-bit):

  0110   = 6
+ 1101   = -3 (in 2’s comp)
-------
 1 0011  ← Carry-out = 1

Discard carry: result $= 0011 = 3$

End-Around Carry (in 1’s complement)

Instead of discarding the carry, you add it back into the result.

Why? Because 1’s complement doesn’t fully represent negatives. Adding back the carry fixes the incomplete math.

Example (4-bit):

  0110   = 6
+ 1100   = 1’s comp of 3
-------
1 0010   ← Carry-out = 1

Add carry back:

$0010 + 1 = 0011 = 3$

Endianness

The word “endianness” itself comes from the term “endian”, which was coined in reference to Gulliver’s Travels by Jonathan Swift. In the book, there is a fictional conflict between two factions: one that breaks their eggs at the small end and another that breaks them at the large end. This is an allegory about a seemingly trivial difference leading to a huge conflict.

In computing, “endianness” refers to the way in which multi-byte data is stored in memory — whether the most significant byte (big-endian) or the least significant byte (little-endian) is stored first.

Big-endian: “The big end of the egg first” — the most significant byte is stored first.
Little-endian: “The little end of the egg first” — the least significant byte is stored first.

What is Endianness In computing?

Endianness refers to the byte order in which data is stored in memory. When you have multi-byte values (like a 32-bit integer), the system needs to decide how to arrange the bytes in memory. There are two common types of endianness:

1. Big Endian

In big-endian systems, the most significant byte (MSB) is stored at the lowest memory address (i.e., at the beginning of the data). So, for a multi-byte value like a 32-bit integer, the most significant part of the number (the leftmost byte) comes first.

Example:

Let’s take the 32-bit integer 0x01020304.

In Big Endian, the bytes would be stored in memory like this:

Address:      0x00     0x01     0x02     0x03
Value:       0x01     0x02     0x03     0x04

In this case, 0x01 is the most significant byte (MSB) and is stored at the lowest memory address (0x00).

2. Little Endian

In little-endian systems, the least significant byte (LSB) is stored at the lowest memory address. This means that for multi-byte values, the rightmost byte (the least significant byte) comes first.

Example:

Now, let’s look at the same 32-bit integer 0x01020304.

In Little Endian, the bytes would be stored in memory like this:

Address:      0x00     0x01     0x02     0x03
Value:       0x04     0x03     0x02     0x01

Here, 0x04 (the least significant byte) is stored at the lowest memory address (0x00).

Why does Endianness Matter?

When data is transmitted over a network or stored in memory, it’s important to know how the bytes are ordered. If two systems (like a server and a client) have different endianness formats, misinterpretation of the data could occur, leading to errors or bugs.

For example:

A network protocol may require data to be sent in big-endian format (also called network order).
But your local computer (host) might use little-endian format.

This mismatch could cause problems unless the data is correctly converted to the expected byte order on both ends.

Network Order vs Host Order:

Network Order: This refers to the big-endian format, as it’s the standard used in most network protocols (TCP/IP).
Host Order: This refers to the endianness used by your specific machine (could be either little-endian or big-endian depending on the architecture, x86 uses little-endian, while SPARC uses big-endian).

Endianness in Action:

When you’re sending data over a network (such as sending a 32-bit integer), you’ll need to ensure the byte order is correct.
For example, if you’re working on a system that uses little-endian (like most x86 machines) but you’re communicating with a system that expects big-endian (like some older network protocols), you would need to convert the data into big-endian before sending it.

Example: Converting Between Endianess

Most programming languages provide functions to handle byte order conversion. For example, in C:

htonl() and htons() (host to network, long and short) convert data to network order (big-endian).
ntohl() and ntohs() (network to host, long and short) convert data from network order to host order.

These functions ensure that data is sent in the right order, no matter what endianness the host system uses.

Middle Endian?

The middle-endian format (sometimes called PDP-endian) is a rare, older format used in some systems (like the PDP-11). It uses a hybrid of big and little endian and is not commonly seen today, so you typically don’t need to worry about it.

ASCII

The Basics of Early Encoding

ASCII stands for the American Standard Code for Information Interchange.

How ASCII works:
ASCII uses a 7-bit binary system, which means it can represent 128 characters (from 0 to 127). This was enough for the English language, including letters, digits, punctuation, and a few control characters (like newline, tab, and carriage return).
Example:
- The letter A in ASCII is represented as 65 in decimal or 41 in hexadecimal. In binary, it’s 01000001.
- The letter B is 66 in decimal, 42 in hexadecimal, and 01000010 in binary.

Here’s the issue: ASCII can only represent 128 characters. This isn’t enough to represent characters from all languages around the world, especially when languages have thousands of characters (Chinese, Japanese, Arabic, etc.).

Code Pages

To solve this limitation of 128 characters, we get Code Pages. A Code Page is essentially a mapping of values (numbers) to characters. The idea is to use the unused values (128–255) from the 8-bit range (1 byte) to add more characters to the encoding.

Example of Code Pages:
- ISO/IEC 8859-1 (also called Latin-1) is a code page for Western European languages (like French, German, Spanish, etc.). It uses values from 128 to 255 to represent accented characters like é or ñ.
- Windows-1252 is another code page used for Western European languages.

The issue with code pages is that they are limited to a region. For example, Windows-1251 is used for Cyrillic (Russian), and it won’t work for Chinese, Japanese, or Arabic characters. Each code page is designed for a specific language group, and you can’t easily switch between them.

Multibyte Character Sets – Representing Non-Latin Languages

For languages that have more than 256 characters (like Chinese, Japanese, and Korean), you need a way to represent thousands of characters. This is where Multibyte Character Sets come into play.

What is Multibyte?
Instead of using 1 byte (8 bits) per character, we use multiple bytes to represent a single character. So, one character might take 2 bytes, 3 bytes, or even more depending on the language and character.
Why Multibyte?
If you only had 256 possible values (1 byte), you couldn’t represent the thousands of characters needed for languages like Chinese. By using multibyte encoding, you can represent a much larger set of characters.

Examples of Multibyte Encoding:

Shift-JIS: This is a multibyte encoding for Japanese characters. It uses both 1 byte (for ASCII characters) and 2 bytes (for kanji, hiragana, and katakana characters).
GB2312: This is a multibyte encoding for Simplified Chinese. It uses 2 bytes for most characters.

However, multibyte character sets are not perfect, and they can be tricky to handle because:

They don’t always define how to handle byte sequences correctly.
They are not universal and are region-dependent (for example, Shift-JIS is for Japanese, not for other languages).

Unicode

Unicode is a universal character encoding standard. Its goal is to represent every character in every language ever written, including ancient languages (like Egyptian hieroglyphs) and constructed languages (like Klingon or Esperanto).

Unicode Code Points

Each character in Unicode is assigned a unique identifier called a code point. A code point is usually written in the format U+XXXX, where XXXX is the hexadecimal value of the code point.

Example:
- The letter A in Unicode is U+0041 (the same as ASCII).
- The Chinese character 字 (meaning “character” in Chinese) has the Unicode code point U+5B57.
- The emoji 😀 has the Unicode code point U+1F600.

Unicode Encodings – How to Store Characters**

There are different encodings that determine how Unicode characters are stored in memory or files. The most common are UTF-8, UTF-16, and UTF-32.

UTF-8

Variable-Length Encoding: UTF-8 uses 1 to 4 bytes to represent a character.
- Characters in the ASCII range (U+0000 to U+007F) use 1 byte, making UTF-8 compatible with ASCII.
- Non-ASCII characters (like Chinese or emoji) use 2, 3, or 4 bytes.
- Advantage: It’s space-efficient for English text and backward-compatible with ASCII.
- Example:
  - The letter A (U+0041) in UTF-8 is just 0x41 (1 byte).
  - The Chinese character 字 (U+5B57) in UTF-8 is 0xE5 0xAD 0x97 (3 bytes).

UTF-16

Fixed-Length Encoding for Most Characters: UTF-16 uses 2 bytes for most characters. However, it can use 4 bytes for some rare characters (those outside the Basic Multilingual Plane).
- Advantage: UTF-16 is more efficient for languages with a lot of characters, like Chinese or Japanese, because it uses fewer bytes for these characters compared to UTF-8.
- Example:
  - The letter A (U+0041) in UTF-16 is 0x0041 (2 bytes).
  - The Chinese character 字 (U+5B57) in UTF-16 is 0x5B57 (2 bytes).

UTF-32

Fixed-Length Encoding: UTF-32 uses 4 bytes for every character, no matter how simple or complex it is.
- Advantage: Every character is represented by exactly 4 bytes, making it easy to calculate the size of a string.
- Disadvantage: It’s very space-inefficient because even simple characters like “A” use 4 bytes.
- Example:
  - The letter A (U+0041) in UTF-32 is 0x00000041 (4 bytes).
  - The Chinese character 字 (U+5B57) in UTF-32 is 0x00005B57 (4 bytes).

Variable Binary Length Data

In many communication protocols, data doesn’t always fit neatly into a fixed-size format. Imagine you have a username for authentication. Some usernames might be very short (like “Bob”), while others could be much longer (like “SuperUltraMegaUser123”). If we try to store these in a fixed-length format (always 16 characters), we end up wasting space for shorter usernames or might run out of space for longer usernames. This is where variable-length data comes in.

Why use variable-length data?
It’s more efficient because it uses exactly as much space as needed for the data, instead of wasting space or being too restrictive.

Now, protocols that transmit data over a network often need to handle these variable-length data values efficiently. Let’s dive into the different strategies used to handle variable-length data.

Terminated Data (A Common Method)

One of the most common methods for handling variable-length data is to use terminators. This means that each data element has a special marker (a terminating symbol) that tells the system where the data ends. You’ve probably encountered this with strings where there is a special symbol like a null character (NUL) to mark the end of the string.

Why use a terminator?
The reason is simple: a terminator is a clear signal that the data has ended, which makes it easy to know where to stop reading. Without a terminator, the protocol wouldn’t know where the data ends, leading to confusion or incomplete data reading.

Example of a Terminated String (NUL-terminated):

Consider the string “Hello”:

Data: H e l l o
Hexadecimal representation: 0x48 0x65 0x6C 0x6C 0x6F
NUL terminator: 0x00

Here’s how it works:

The string “Hello” is stored as a sequence of bytes, and the NUL value 0x00 marks the end of the string.

Why does this work?

The NUL character is not typically part of a string, so it’s safe to use it as a marker to end the string. When the protocol reads data, it stops reading as soon as it encounters 0x00 (the NUL terminator). This tells the system, “Okay, that’s the end of the string!”

Why does this approach work well?

The protocol doesn’t need to know ahead of time how long the string is. It just reads until it sees the NUL terminator.

What if the terminator (NUL) appears in the string?

Problem: If the data itself contains a 0x00 byte (for example, in some binary data), it could prematurely signal the end of the string.
Solution: Escape characters are used to prevent confusion. For example, if the string contains a 0x00, it might be escaped with a special symbol like a backslash (\0) or repeated 0x00 to signal that it’s not the terminator but part of the string.

Bounded Data (Start and End Markers)

Another method for handling variable-length data is bounded data, where you specify a start and end marker. In this case, the data is surrounded by specific characters that mark where the data begins and ends.

Example: Quoted String

Consider a string "Hello" that is bounded by double quotes. This means the protocol knows that:

The data starts with a " (quote character), and
The data ends with the matching " (quote character).
Data: "Hello"
Hexadecimal representation: 0x22 0x48 0x65 0x6C 0x6C 0x6F 0x22

Here, the double quotes 0x22 mark the beginning and end of the string.

The start quote signals where the string begins.
The end quote signals where the string ends.

Why use bounded data?

Predictability: The protocol always knows exactly where the string starts and ends, so it doesn’t need to rely on special terminators like NUL.
Flexibility: You can use any symbols as delimiters (quotes, braces, etc.), which is very useful in some protocols (like JSON).

Problem with Bounded Data?

If the boundary symbols (like quotes) appear inside the data itself, the protocol would have trouble distinguishing between data and the boundary symbol.
Solution: Escape the boundary symbols using special characters. For example, "Hello "Bob"" would escape the quote inside the string as \".

Why is this approach used?

Human-readable format: It’s easier to interpret because it often mimics common syntax in programming languages (like JSON, XML, etc.), making it understandable for both computers and humans.

Length-Prefixed Data**

This is another strategy where the length of the data is specified before the data itself. This is often used in binary protocols because it’s efficient and avoids the ambiguity of terminators or boundaries.

Example:

Imagine sending a string like "Hello":

The length of the string (which is 5) is sent first, followed by the actual data.
Data format: [Length][Data]
So it would look like this:
- Length: 0x05 (5 bytes)
- Data: H e l l o
Why Length-Prefixed?
This approach solves the problem of variable-length data by making sure the receiver always knows how much data to read, without needing terminators or boundary markers.

Why use this method?

It’s very efficient because the protocol doesn’t need to scan the data for terminators or matching symbols. The length is specified right upfront, so it knows exactly how much to read.
Predictable: No need for special symbols to mark the start or end of the data.

Implicit-Length Data**

Sometimes, the length of the data is implicit from the context. For example, if you’re transmitting a fixed-length header or using a fixed format, the protocol already knows how long each field will be.

Example:

A protocol might always expect the first 4 bytes to be a header, followed by a 10-byte username. If the protocol always knows that the username is 10 bytes long, it doesn’t need to specify the length each time.

The format could look like this:
- Header (4 bytes) + Username (10 bytes)
- You don’t need to explicitly state the length of the username since the protocol knows it’s always 10 bytes.
Why use Implicit-Length?
- It’s efficient because it doesn’t require sending additional length information, reducing overhead.
- It’s useful when data formats are fixed and predictable, which is common in low-level protocols.

Padded Data

In some cases, protocols require that data be padded to a specific length. This ensures the data always takes up a predictable number of bytes, which can be important for alignment or buffer management.

Example:

If a protocol expects data to be 16 bytes long, it might pad shorter data with extra bytes (often zeros) to ensure it’s always 16 bytes.

Data: "Hello"
Padded Data: "Hello" + 0x00 0x00 0x00 0x00 0x00 to make it 16 bytes.

Why padding?

Padding ensures the data is a fixed size, which is important for buffer alignment in low-level systems.
It can also help prevent buffer overflows or make the data easier to manage.