Projects for Statistical Computing

Navigation

Basic coding with python.
Due 06/08/2021

Project 1: Create a python file with a function that takes a positive integer m as its input and finds the first m primies (singular: primy). A primy is defined as follows. First, 2, and 3 are primies. An integer greater than 3 is a primy if it cannot be written as the product of 2 or 3 primies (that are not necessarily distinct). Thus, 4 is not a primy, because it can be written as 2*2. Simiularly, 12 is not a primy, because it can be written as 2*2*3. But 16 is a primy, because you can check that it cannot be written as the product of 2 or 3 primies. The first several primies are 2,3,5,7,11,13,16,17,19. In this and other projects, you should submit the file and also a set of console input/output examples showing that it works. The function should return an error message in the case that the input is not a positive integer (eg text or a negative number or something like 2.8). Note that your output is m primies, *not* the set of primies less than m. Example of output: function applied to m=19: [2, 3, 5, 7, 11, 13, 16, 17, 19, 23, 24, 29, 31, 36, 37, 40, 41, 43, 47]

GUIs and interactions
Due 06/15/2021

Project 2: Make a GUI that allows the user to input the integer k into a text box, and then the GUI prints out the first k fourceful prime pairs, which are pairs of primes that are separated by 4 [ eg, (3,7), (7,11), (13,17) are the first three fourceful prime pairs.] Use colors and images to make your GUI look interesting. If the user inputs something that is not a positive integer, the GUI should display a message.

CSV input and output, numpy, pandas
Due 06/22/2021

Project 3: Create a python file with a function that generates one 4x4 matrix, where the computer has filled the entries with random integers between -7 and 5, inclusive. The script should have another function that asks the user how many computations should be made. After the user inputs a positive integer, say for example 37, the python script should generate 37 of these matrices and create a csv data file with 37 rows where the first column is the first-row, first-column entry of the matrix, the second column is the trace of each 3x3 matrix and the third column is the determinant of the 3x3 matrix. Note that the matrices themselves are not stored in the csv file. Next, the script should output (in the console) the following information: mean, median, standard deviation of each column in your csv file (9 values total), and also it should calculate the percentage of the matrices where the trace is less than 0 and also the percentage of the entries where the determinant is in absolute value less than 10. (So this is a total of 11 outputs of information in the console, in addition to the csv file.) The script should produce an error message if the user does not input a positive integer.

Excel input and output, tkinter GUIs and plots
Due 06/29/2021

Project 4: Create a python script that does the following. It should have a function that chooses two points at random in the unit square { (x,y): 0< x< 1, 0 < y < 1 }. The script should start with a GUI with a text box, and the user should input a positive integer n. After the user pushes a button, the script should call the function n times and then create an xlsx file with three labeled columns 'x1', 'y1', 'x2', 'y2' ,'d', and then with n rows. Each row should have 5 entries given as the x- and y- values of the coordinates of the two points followed by the distance d between the two points. Also, the GUI should then display 2 plots. The first plot should graph each pair of points and the line segment connecting them. The second plot should make a histogram of all the values of d from your set of n pairs of points.

Investigations with R
Due 07/20/2021

Project 5: Make an R Script that does the following:
1. All of this should be in a shiny GUI, where the user selects an integer k between 1 and 100, inclusive. You may choose to use a slider or dropdown menu for this part. There should be some text written above the selection part which tells the user what to do. After the selection is made, the script does the following.
2. First, it generates a CSV file with the following data, in 200 rows. For each row, the first column should be a randomly chosen prime number between 1 and 4000. The second column should be a floating point number that is taken from a normal distribution whose mean is the k times the given prime number and whose standard deviation is 10. You should make up some appropriate column headings for this csv file.
3. The GUI displays the plot of all the points (x,y) and also displays the least squares fit line on the same plot. It should also show the equation of the line drawn.
4. The GUI also displays the plot of all the points (x,y) such that the prime x is less than 3000 and also displays the new least squares fit line (restricted to that data) on the same plot. Again, it should also show the equation of the line drawn.

SQL Practice
Due 07/27/2021

Project 6: For this exercise, you will determine the sequence of SQL commands needed to do the job, and then you should submit by sending that inside a text file to me. Your project should do the following. Create a MySQL database, and create a table with the following columns. The first column should give a first name, chosen randomly, which agrees with 5 of the first names of people in your family (or extended family). The second column y should be computed as number of characters in the first name. The third column should be a randomly chosen number between y and y^2. You should generate 100 rows in your table. Next, generate a second table with all of the data in the first table, except that you have removed all of the data corresponding to one particular first name, and you have sorted the table rows according the the third column values, in descending order. Find the average of the second column and the maximum value M of the third column. Generate a new table from the first table, only including the rows with third column value greater than M/2. Finally, general a list of all the distinct numbers in the second column of the first table.

Final Project with Python and R
Due 08/06/2021

Final Project: This final project will involve an investigation of some machine learning algorithms that are used in predictive modeling of data. You will be making two scripts - one in python and one in R, and both scripts should use a GUI to display the results nicely. To see an examples of what packages and code should be used in your projects, go to these websites:
simple python machine-learning project
simple R machine-learning project
1. First, go to kaggle.com, and register for a free account to access datasets. After you join, click on Datasets from the menu, and find a dataset that you want to do your project on, and download it.
2. Find a parameter in the dataset that you would like to predict from the other values. The user decides what percentage of the dataset should be the part that the machine learning model should use as the training set. The actual rows used in the training set should be randomly chosen every time the program runs. Test at least 3 different machine learning models on the data, and evaluate the accuracy (similar to what the websites above do).
3. Your GUI should have a slider or textbox where the user selects the percentage of training data from the whole dataset. After the selection, the script displays the results of the test. A graph should be shown. Since the training set is randomly chosen, each time a selection is made, a different result should appear.

Download and access instructions

To download python and Anaconda (with Spyder and Jupyter): https://www.anaconda.com/products/individual
To download R and RStudio: First install R at https://cran.r-project.org/mirrors.html (follow instructions to get it completely installed). Then go to https://www.rstudio.com/products/rstudio/, and install the RStudio Desktop.
To download MySQL Community Server and MySQL Workbench: Go to https://dev.mysql.com/downloads/mysql/, and download and install the version of MySQL Community Server for your operating system. Then go to https://dev.mysql.com/downloads/workbench/ and install MySQl Workbench. (Choose Complete installation options.)
Logging in to TUC 353 computers remotely (to Access programs like SAS, Tableau, Matlab, etc.):
1. Establish a VPN-STU.TCU.EDU connection as per these instructions.
2. Open a web browser and navigate to https://labaccess.tcu.edu.
3. Choose a computer inside TUC 353 (and then click on it). If you have done this before, it will be faster if you choose the same number as before.
4. A file will download to a spot on your computer. Then double-click on it. Do whatever log-ins or installs are necessary.
5. You will now be running the PC in TUC 353.
6. You can then start working on whatever - you might want to either save your files to your U: or M: drive or email them to yourself and then delete them off of the TUC 353 computer before logging off.
7. *Very Important* Be sure to log off the lab computer when you are finished.
8. Disconnect from VPN-STU.TCU.EDU.