CS50 Week 2 – Arrays

Errors and bugs

There is a difference in types of error; syntax and logic error. Syntax error is when you make a mistake like miss a header file or semi colon, logic error is when the way your code works is incorrect. A debugger is a real piece of software that will debug code. In CS50 we have debug50. In the real world VScode has it’s own debugger built in. When we are typing code out, we can use a bunch of print statements through the process that prints essential information to us about what is going on; the number of i in a loop for example. This will help us if there is any troubleshooting, but it can get quite complicated and there will be loads of print functions everywhere that we will have to delete eventually. A debugger helps with this. It will run through the code one line at a time, with the user needing to input like they would usually (if there is input). If there are some variables to be assigned, you can see that in the variables section. You can see them being created, assigned and then deleted when they become out of scope. To advance through the code there is the Step Over button which advances line by line, loop by loop, and the Step Into button which will step into functions you have created. There is also the play button which runs through the whole code at once.

Compiling

So far when we want to compile our C code into machine code, we use the command ‘make’. It works very conveniently. Make uses a command called ‘clang’. We can also use clang. we can type: clang hello.c and it will output a machine code file named a.out (assembly output), but it won’t know where to go to get the cs50.h header file library, it won’t name the machine code file something useful like hello. When we use make, it does all of those useful things for us. If we wanted to get clang to work and use the cs50 library, and name the output file something useful, can use: clang -o hello hello.c -lcs50. Technically, when we say compile, we are actually referring to four separate processes:
1. Preprocessing
2. Compiling
3. Assembling
4. Linking

Preprocessing is taking the functions from the header files that we are using in our code, and copying them at the top as the prototypes. Compiling takes the code and turns it into assembly language. Assembly takes the assembly language and turns it into 0s and 1s or machine code. Linking is where the machine code from the header files is brought in and linked with the machine code from our code machine.
Reverse engineering can theoretically happen on machine code to decompile and produce human readable code. In the past this was done using complicated maths, and would still require a lot of manual human work to recreate something useful and human readable. Recently AI has been helping with pattern recognition and context inference.

Data Types

We already know there are different data types. They take up different amounts of memory:
– Bool – 1 byte
– Int – 4 bytes
– Long – 8 bytes
– Float – 4 bytes
– Double – 8 bytes
– Char – 1 byte
– String – dependant on number of characters

When we write this data to memory, we are assigning these bytes to the actual switches on the memory board inside the computer. Each byte on a RAM chip has an address. Data written to RAM is contiguous; for example an int doesn’t use 4 bytes in random place on the chip, it uses 4 consecutive bytes.
We know that int stores a whole number, and floats store decimal numbers. If we define some int variables and do maths on them and we want a float answer, we don’t need to change the inputs to be float, we just need the answer to be defined as a float, and have a decimal number in the equation or cast a number as a float by writing it as (float) 3.

Arrays

An array is a chunk of contiguous memory that contains multiple values.

int scores[3];
scores[0] = 72;
scores[1] = 73;
scores[2] = 52;

This will give us a chunk of memory where we can store 3 ints. This will give us 12 bytes. Then we tell the computer what to store in the array. We can also use a variable to set the number of ints in the array and use a loop to store values in an array:

const int N = 3;
int scores[N];
for (int i = 0; i < N; i++)
{
    scores[i] = get_int("Score: ?\n");
}

Notice how the N int is capitalised, that is standard naming convention for a constant variable. Say we need to create a function that takes an array as an input, we must also pass in the length of the array (N above), we cannot compute the length in C like we can in python. Consider:

#include <cs50.h>
#include <stdio.h>

float average(int length, int numbers[]);

int main(void)
{
    const int N = 3;
    int scores[N];
    for (int i = 0; i < N; i++)
    {
        scores[i] = get_int("Score: ?\n");
    }
    printf("Average= %f\n", average(N, scores));
}

float average(int length, int numbers[])
{
    int sum = 0;
    for (int i = 0; i < length; i++)
    {
        sum += numbers[i];
    }
    return sum / (float) length;
}

In the prototype, we now have a function that has 2 arguments; the length (which in main is N), and the array, as denoted by square brackets. The loop in main gets the numbers and puts them into an array. The printf function below takes the length and the scores array and passes it to the average function. The average function takes the length and the array and sums the array values using a for loop, and returns the sum/length. Bingo bango. The average is printed. Here we used a loop to input values into an array, but it is possible to do the following:
int scores[] = {72, 73, 33};
The compiler will know that the array has 3 values, so we don’t need to specify in the declaration.

Strings

It turns out the string data type we have been using is actually just an array of chars! Check this out:

#include <stdio.h>
#include <cs50.h>

int main(void)
{
    string s = "HI!";
    printf("%i, %i, %i, %i\n", s[0], s[1], s[2], s[3]);
}

This prints out: 72, 73, 33, 0. These are the ASCII numbers that represent the letters in the word “HI!” followed by a 0 which terminates the string. It is also written as \0 just like \n, but it means eight 0 bits. This is called the null character, or NUL. We can store strings inside an array:

#include <stdio.h>
#include <cs50.h>

int main(void)
{
    string words[2];
    words[0] = "HI!";
    words[1] = "BYE!";

    printf("%s\n", words[0]);
    printf("%s\n", words[1]);
}

This prints HI! followed by BYE! on a new line. Since a string is just an array, by storing a string inside and array, we actually storing an array inside an array. This means we can index inside the array and the array within the array:

#include <stdio.h>
#include <cs50.h>

int main(void)
{
    string words[2];
    words[0] = "HI!";
    words[1] = "BYE!";

    printf("%c%c%c\n", words[0][0], words[0][1], words[0][2]);
    printf("%c%c%c%c\n", words[1][0], words[1][1], words[1][2], words[1][3]);
}

This too prints HI! followed by BYE! on a new line.

We can ascertain the length of a string (which is an array) using the following:

#include <stdio.h>
#include <cs50.h>

int main(void)
{
    string name = get_string("Name: ");

    int n = 0;
    while (name[n] != '\0')
    {
        n++;
    }
    printf("%i\n", n);
}

This sets an int variable n to 0, the while loop looks at the first (0) character in the string name, and if it not equal to NUL (notice single quote marks because it is a character and not a string) then it increments n and loops to the second character (1) and so on. When the character does equal NUL the loop breaks and it prints the value of n. But there are better solutions in other libraries, specifically the string.h header file. This header files contains a function that output the length of a string; strlen. This strlen function looks a lot like the code above:

#include <stdio.h>
#include <cs50.h>
#include <string.h>

int main(void)
{
    string name = get_string("Name: ");
    printf("%i\n", strlen(name));
}

Notice how we need to include the string.h header file. We can also nest functions, so we can put strlen(name) in the printf function.

Now consider the following:

#include <stdio.h>
#include <cs50.h>
#include <string.h>

int main(void)
{
    string s = get_string("Input:  ");
    printf("Output: ");
    for (int i = 0; i < strlen(s); i++)
    {
        printf("%c", s[i]);
    }
    printf("\n");
}

This simply prints the input character by character. But there is an inefficiency. The boolean expression i < strlen(s) executes every loop. Once s is input into memory it isn’t going to change, so we don’t need to keep asking what is the length of s, we only need to ask once. We could put that function before the loop and store the length in its own variable. Or we could do something new and cool, we could declare another variable within the loop alongside i:

#include <stdio.h>
#include <cs50.h>
#include <string.h>

int main(void)
{
    string s = get_string("Input:  ");
    printf("Output: ");
    for (int i = 0, n = strlen(s); i < n; i++)
    {
        printf("%c", s[i]);
    }
    printf("\n");
}

Notice now that there are 2 variables set within the loop; i and n, separated by a comma, not a semicolon. It is only possible to use one data type, so you don’t have to write int i = 0, int n = strlen(s). If you need to use 2 or more data types, then you have to pull the variable outside of the loop and declare it earlier.

Lets look at how to change upper and lower case. In the ASCII table, upper case characters are in the range 65 to 90, lower case 97 to 122. The space between them is 32. So to convert between capital A and lower case a, is to add 32 to the character.

#include <stdio.h>
#include <cs50.h>
#include <string.h>

int main(void)
{
    string s = get_string("Before: ");
    printf("After:  ");
    for (int i = 0, n = strlen(s); i < n; i++)
    {
        if (s[i] >= 'a' && s[i] <= 'z')
        {
            printf("%c", s[i] - 32);
        }
        else
        {
            printf("%c", s[i]);
        }
    }
    printf("\n");
}

Look at the if statement; if the character is between lower a and lower z inclusively; we can use this syntax, we don’t have to use numbers. Inside the printf statement we use numbers, and subtract 32 from the character. This converts the input to upper case. But, there is a better way of doing this. Someone has actually already done this for us, and included it in the ctype.h library. This library has a bunch of functions pertaining to characters in ASCII.

#include <stdio.h>
#include <cs50.h>
#include <string.h>
#include <ctype.h>

int main(void)
{
    string s = get_string("Before: ");
    printf("After:  ");
    for (int i = 0, n = strlen(s); i < n; i++)
    {
        if (islower(s[i]))
        {
            printf("%c", toupper(s[i]));
        }
        else
        {
            printf("%c", s[i]);
        }
    }
    printf("\n");
}

First thing; we have included the new library. Secondly we have replaced our maths with the islower() and toupper() functions. Cool beans. However, if we look at the documentation for the ctype library, we would see that the toupper() function already ignores uppercase letters, and will just print them unchanged, so we can tighten our code up as follows:

#include <stdio.h>
#include <cs50.h>
#include <string.h>
#include <ctype.h>

int main(void)
{
    string s = get_string("Before: ");
    printf("After:  ");
    for (int i = 0, n = strlen(s); i < n; i++)
    {
        printf("%c", toupper(s[i]));
    }
    printf("\n");
}

NICE!

int main(void)

So far we have been writing code with int main(void). Now we will find out why. Main is always the main code that will always be called. We can obviously create our own functions to be included that may or may not be executed, but main will always be executed.

The (void) we have been using so far means that there are no command line arguments. However if we want to input data into our program from the command line (just like make or clang, or even cd), we use the following syntax: int main(int argc, string argv[]). When we type this in code, we are giving the main function exactly 2 parameters; argc and argv, which will automatically catch any number of command line arguments the user types. argc(argument count) is an integer, argv(argument vector) is an array with strings in (some of which are the arguments that we type in in the command line). Vector in this case is the mathematical definition which means a 1 dimensional array of data, what we now call just an array. Now when we run a program (“testprogram”) that has the above int main in, the program allows us to write some arguments: ./testprogram input1 input2. In this example I have written 2 inputs. Automatically argc is set, and in this case it equals 3. We shall see why:
argv[0] = testprogram
argv[1] = input1
argv[2] = input2
You can see how position 0 of the array is the program name; testprogram (this is always the case) the array positions after 0 contain the arguments. argc is automatically set to 3; there are 3 array elements. Now we can use this information in our program. Lets say we are writing a simple ‘hello name’ program. We can write some if conditionals to treat the inputs. If argc == 1 then we know the user hasn’t typed in any arguments. If argc == 2 we know the user has typed in one word, maybe their first name. If argc == 3 then we can assume the user has typed in first and last name. If argc is greater than 3 we know that user has typed in more than 2 names. We can write programs to respond to all sorts of inputs. Consider the ls shell command, it lists the contents of the current working directory. This command also has flags we can add to modify the output, some of these being -l and -a. The code will look for these flags by comparing the arguments to -l and -a and modify the output as necessary.

Now let’s talk about int. The main function returns this int. This int is an exit status. These are kind of like error codes. We are all familiar with error 404 not found when trying to access a website and the page isn’t there anymore. These codes don’t mean anything to normal people, but they are useful for the engineers that built the program. They can use that code and look it up in a table or documentation to see exactly what the problem is. We can use this exit status in our own programs too. By convention an exit status of 0 means no problems. This means we can use the numbers from 1 upwards to describe any errors that we may encounter. Look at the following:

#include <stdio.h>
#include <cs50.h>

int main(int argc, string argv[])
{
    if (argc != 2)
    {
        printf("Missing command-line argument\n");
        return 1;
    }
    printf("hello, %s\n", argv[1]);
    return 0;
}

If the user doesn’t type in an argument, “Missing command-line argument” is displayed on the terminal, and an exit status of 1 is stored in the terminal in the variable ? and we can use that. After the program ends, we can type into the terminal echo $?. Echo is like the print function, $ is the symbol for the terminal to give us the value inside of, and ? is the variable name for the exit status.

Cryptography

Cryptography is very important for keeping things secure. Passwords and credentials shouldn’t be broadcast without being encrypted otherwise attackers can see them in plain text. We use encryption to obfuscate this plain text into something seemingly garbage, or ciphertext. So there are algorithms that turn plaintext into ciphertext. However, we need to be able to de-encrypt ciphertext back into plaintext to be useful. This is where a secret key comes in. The key is the instruction the algorithm uses to encrypt and decrypt. A simple way is to say that plaintext A becomes ciphertext B and plaintext B becomes ciphertext C etc. So the key here would be plus 1. It becomes more difficult for an attacker to decrypt the message, but in reality not very hard at all. All the recipient of the cipher text has to do is subtract 1 from each character. Another simple letter substitution cipher is called rot13, or rotate by 13 places. You simply shift the letter 13 places. It is quite brilliant in the way you decrypt with rot13; you just apply the same rot13. The alphabet is 26 characters long, if you rotate it by 13 you get the cipher text, and if you rotate it again by 13 you get the original message. Again, this is not really secure as anyone can crack this super quick. These two encryption methods fall under the “Caeser cipher” heading. They substitute the characters for other characters further along the alphabet.