Two common formats for representing binary data are hexadecimal and Base64.
Golang Hex and Base64 |
Bits, bytes, numbers and characters
The smallest element of data on a computer is a bit, which can either be '1' or a '0'. Normally, we store these bits within memory locations which contain eight bits (a byte). The smallest addressable element in memory is thus a single byte. These bytes can then be defined either as numbers (real or integer), characters (such as for eight or 16 bit representations), strings (with zero or more characters), or a binary values. In order to read the data elements correctly, we must define how many bytes are used to represent the data element. For numerical values, the main classifications for numbers are: integers; rational numbers; real numbers; and complex numbers, and which are defined in sets as:
- Integers can be positive or negative numbers and have no fractional part. They are represented with the \)\mathbb{Z}\) symbol {... -2, -1, 0, +1, +2, ...}.
- Rational numbers are fractions (\(\mathbb{Q}\)).
- Real numbers (\(\mathbb{R}\)) include both integers and rational numbers, and any other number that can be used in a comparison.
- Prime numbers (\(\mathbb{P}\)) represent the integers which can only be divisible by itself and unity.
- Natural numbers (\(\mathbb{N}\)) represent positive numbers which are integers {1,2,...}.
For integers, we can have:
- char (byte). This uses eights bits and ranges from 0 to 255.
- signed char (char). This uses eights bits and ranges from -127 to 128.
- short (short). This uses 16 bits and ranges from -32,768 to 32,767.
- unsigned short (ushort). This uses 16 bits and ranges from 0 to 65,535.
- int (int). This uses 32 bits and ranges from -2,147,483,648 to 2,147,483,647.
- unsigned int (uint). This uses 32 bits and ranges from 0 to 4,294,967,295.
- long (long). This uses 64 bits and ranges from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
- unsigned long (ulong). This uses 64 bits and ranges from 0 to 18,446,744,073, 709,551,615.
For integers larger than 64 bits, we can use a Big Integer (Big Int) format, and which represents the integer values as strings, but are operated on with integer operations (such as integer add, subtract, divide and multiply). For real numbers, we typically uses a four-byte (32-bit) floating point value (float), or an eight-byte (64-bit) floating point value (double). A float ranges from \(-3.402 \times 10^{38}\) to \(-1.175 \times 10^{-37}\), and a double from \(-1.79769 \times 10^{+308}\) to \(-2.225 \times 10^{-307}\).
Memories thus store data in bytes, and where each byte has a unique memory location. The order the values are stored in depends on the computer architecture type. Most PC systems using Intel processors use a \emph{Little Endian} format, where the least significant byte is stored in the lowest memory address. Thus, if we have an unsigned 32-bit value of 0x01020304 (16,909,060), the value is stored at:
Location (100h): 01 (Most significant byte) Location (101h): 02 Location (102h): 03 Location (103h): 04 (Least significant byte - at the end)
Most processors now use the Little Endian format. The Big Endian format has been used in IBM z/Architecture mainframes, where the most significant byte is stored in the lowest memory address. It is also used in network packets such as with TCP and IP headers. The original Apple Mac computer actually used the Motorola 68000-series processor ("68k" architecture), and which used a Big Endian format. This caused considerable problems in porting software from a PC (which used the Intel x86 Little Endian architecture). In 2006, Apple released its first computers using Intel processors and supported an operating system that supported by "x86" and "68k" software. Eventually, Apple migrated away from "68k" and which now only supports Little Endian data formats for the storage of data in memory.
Encoding
We live in an analogue world and see light reflecting off a screen, and where we detect changes in colour and brightness. This allows us to see shapes, and then convert these into objects. Around a third of the human brain has a core function of analysing the things that we see - the visual system - and which contained within the cortex. For shapes, such as the letters of the alphabet, we need the object cortex, and where we can differentiate between things that have the same shape, such as a ball, the letter 'O', and an apple. The visual cortex can also make sense of seeing a car at different viewing angles, but still know that it is a car - even if the car was upside down. Computers, though, can only process on binary data, and must convert our analogue world into a digital form. For them, a letter that we identify in our brain as the shape of the letter 'e' is represented by the binary pattern of "0110 0101". We thus differ in the ways that we often identify things around us.
So, our human visual processing mechanism is extremely powerful at making sense of our world, and in spotting objects no matter how they are presenting to us. As humans, this is fundamental, and is a core part of keeping us safe. The fast recognition that a car is speed towards you, allows you to quickly make a decision on the best course of action to take. For a machine, the data must be converted into a binary form. We thus have a problem. How can we communicate effectively with a common format, so that humans can define things to a machine - such as binary patterns - and for a machine to create something that a human would understand? We thus encode our data, and into forms that humans can understand. For binary patterns, such as for encryption keys, we might use a hexadecimal format, and for non-printable characters (such as a tab space) we could use a special escape character. And so, while humans are fast at recognising objects, our brains need complex models of objects, and have to continually realign for our analogue world. For a machine, once captured, an 'e' can just be stored with eight bits of data.
Binary, hex, octal
On a computer system, code and data are represented as binary, but humans find it difficult to deal with binary formats, so other formats are used to represent the binary values. Two typical formats used to represent characters are ASCII and UTF-16. With ASCII we have 8-bit values and can thus supports up to 256 different characters (\(2^8\)). Within ASCII coding, we map printable characters, such as ‘a’, and ‘b’, to decimal, binary and hexadecimal values. UTF-16 extends the character set to 16-bit values, and thus gives a total of 65,536 characters (\(2^{16}\)).
ASCII Binary Hex Decimal ------------------------------ 'e' 0110 0101 0x65 101 'E' 0100 0101 0x45 69 ' ' 0010 0000 0x20 32
We also have other ‘non-printing’ characters which typically have a certain control function. These include CR (Carriage Return), LF (Line Feed), Horizontal Tab (HT) and Space.
ASCII Binary Hex Decimal Character representation ------------------------------------------------------ CR 0110 0101 0x0D 13& \r LF 0100 0101 0x0A 10& \n HT 0000 0111 0x07 7& \t
Some important non-printable ASCII characters are: New line (0x13); Carriage Return (0x10); Tab (0x07); and Backspace (0x08), while a Space is represented by 0x20. The representations are for ‘A’ and ‘B’ are defined below. Within text files, for example, we are likely to have line breaks, and which are created by the CR (Carriage Return) and LF (Line Feed) characters. In Microsoft Windows-type systems, we use CR and LF at the end of a line (\n\r), while a Linux/Mac-type system will only use CR for a new line (\r).
Figure 1: Conversion from binary into hexadecimal or Base-64
Char Dec UTC-16 ASCII Hex Oct HTML -------------------------------------------------------- A 65 00000000 01000001 01000001 41 101 A B 66 00000000 01000010 01000010 42 102 B
The other core format that we used to represent data as a byte array, and which is a number of byte values (0-255) that represents our data. For example, when we encrypt into ciphertext it produces a bit stream which contains non-printing characters, and thus we need represent them in a printable way. We may also need to represent our encryption keys in a printable and/or distributable manner. For this we often use a hexadecimal or Base-64 format as it allows us to represent a hexadecimal format into a printable format (Figure 2).
Hexadecimal and Base-64
The conversation to hexadecimal format involves splitting the bit stream into groups of four bits (Figure 2) and for Base-64 which splits into six bits groups (Figure 3). With the hexadecimal format, we have values from 0 to 15, and are represented by four-bit values from 0000 to 1111. For Base-64, we take six bits at a time. For example, if we take an example of “fred“, then we get:
ASCII f r e d Binary 01100110 01110010 01100101 01100100
Figure 2: Conversion to hex
Figure 3: Conversion to Base-64
To convert to Base-64, we group into six bit groups:
Binary 011001 100111 001001 100101 011001 00
And then map these to a Base-64 table:
Binary 011001 100111 001001 100101 011001 00 Decimal 25 39 9 37 25 0 Base-64 Z n J l Z A
The result is ZnJlZA. With Base-64, we create groups of four Base-64 characters, and we pad with zeros to fill-up the six bit values, and then use the “=” character to pad to create groups of four Base-64 characters:
test -> 01110100 01100101 01110011 01110100 test -> 011101 000110 010101 110011 011101 00[0000] = = test -> d G V z d A = = help -> 01101000 01100101 01101100 01110000 help -> 011101 000110 010101 110011 011101 00[0000] = = help -> a G V s c A = =
Unfortunately some of the characters can look similar when they are printed, such as whether we have a zero ('0') or an 'O'. To avoid this we can convert to a Base-64 format, but where there are similar-looking letters: 0 (zero), O (capital o), I (capital i) and l (lower case L), and non-alphanumeric characters of + (plus) and / (slash). The solution is Base-58, used in Bitcoin applications, and where we remove the characters which are similar looking. For Base-58, we convert the ASCII characters into binary, and the keep dividing by 58 and convert the remainder to a Base58 character. The alphabet becomes:
123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz
If we take an example of 'e'. With 'e' we have a decimal value of 101, so we divide by 58 to get 1 remainder 43. and next we divide 1 by 58 and we get 0 remainder 1. We then take the character at position 1 and at position 43, to give 2k. If we now take 'ef', we get 25,958 (102 + 101 × 256), where we move each character up one byte. Basically we take the binary value of the string and then divide by 58, and take the remainder. So 'ef' is '01100101 01100110'.
Code
Here is the code:
package main import ( "crypto/md5" "crypto/sha256" "encoding/base64" "encoding/hex" "fmt" "os" ) func main() { s := "hello" argCount := len(os.Args[1:]) if argCount > 0 { s = os.Args[1] } hash := md5.Sum([]byte(s)) hash2 := sha256.Sum256([]byte(s)) fmt.Printf("Input:\t\t\t%s\n", s) fmt.Printf("Input (hex):\t\t%x\n", []byte(s)) fmt.Printf("Input (hex):\t\t%s\n", hex.EncodeToString([]byte(s))) fmt.Printf("Input (Base64):\t\t%s\n", base64.StdEncoding.EncodeToString([]byte(s))) fmt.Printf("MD5 (Hex):\t\t%x\n", hash) fmt.Printf("MD5 (hex):\t\t%s\n", hex.EncodeToString(hash[:])) fmt.Printf("MD5 (Base64):\t\t%s\n", base64.StdEncoding.EncodeToString(hash[:])) fmt.Printf("SHA-256 (Hex):\t\t%x\n", hash2) fmt.Printf("SHA-256 (hex):\t\t%s\n", hex.EncodeToString(hash2[:])) fmt.Printf("SHA-256 (Base64):\t%s\n", base64.StdEncoding.EncodeToString(hash2[:])) }
A sample run is:
Input: hello Input (hex): 68656c6c6f Input (hex): 68656c6c6f Input (Base64): aGVsbG8= MD5 (Hex): 5d41402abc4b2a76b9719d911017c592 MD5 (hex): 5d41402abc4b2a76b9719d911017c592 MD5 (Base64): XUFAKrxLKna5cZ2REBfFkg== SHA-256 (Hex): 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 SHA-256 (hex): 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 SHA-256 (Base64): LPJNul+wow4m6DsqxbninhsWHlwfp0JecwQzYpOLmCQ=