I know what you are all thinking...finally!
Okay let's check out the basics of Python.
I am typing this inside of Jupyter notebook which yields a markdown/programming environment similar to R markdown.
First let us discuss the basics of Python. Here are our standard types:
3
type(3)
3.0
type(3.0)
type('c')
type('ca')
type("ca")
True
type(True)
type(T) #Not defined unlike R
type(true)
type(x=3) #An assignment does not return a value. This is different from C/C++/R.
x=2 #assignment
x
x==3 #Boolean
Okay let's check out some syntetic data structures.
y = [4.5,x, 'c'] #lists can contain different types
type(y)
y[0] #zero indexing
y[1]
y[-1] #last entry
y[-2]
len(y)
y = y + ['a','b','d']
y
y[1:3] #Slicing!
y[1:4]
y[1:6:2] # jump by twos
y[:] #copy entire list
z = y
z[1]=3
y
z = y[:]
z[1]=2
z == y
z[1]
y[1]
z = y[::-1] #Reverse order
z
Now let us look at dictionaries.
a = {'x' : 1, 'y' : z, 'z' : 'entry'}
a
a['x']
a['y'][3]
a.values()
a.keys()
'abc'+'efg'
'abc'[2]
'abcdef'[-2]='x' # strings are immutable (as usual)
'abc'.upper()
There are also tuples, which are non-transformable.
x = (1,2,3)
x
type(x)
x[2]
x[-2]=3 # Fails
Okay, now some of my favorite features, list and dictionary comprehensions, which allow us to use syntax similar to the mathematician's set notation
w = [ a**2 for a in range(10)]
w
Note that exponentiation in python is done with the symbol **
Also note that the range function works a bit like slicing.
[a for a in range(1,20,2)]
We can also select subsets:
[a for a in range(1,20) if a % 2 != 0]
Just for fun let's see how to build a simple encrypter. First let us import a variable of printable characters from the module string and denote it by chars.
from string import printable as chars
chars
lc = len(chars); lc
codebook = {chars[i] : chars[(i+lc//2)%lc] for i in range(lc)}
There is a couple new things going on in this previous line, so let's unpack it. First we have dictionary comprehension, which is defined like a list comprehension. We have used the integer division operateor // and the integer modulus operator %.
codebook['a']
codebook['Y']
Now we are going to use one of the three core functions from functional programming (map, reduce, filter), namely reduce. This goes through a list item by item and applies a two variable function using an accumulated value for the first argument and the list element for the second. We only need to use this two variable function once, so we will use an anonymous/lambda function.
Finally, it is important to note the absence of brackets indicating the start and end of the function. Python accomplishes this using spacing. This is very unusual, but in Python spacing has meaning and if you use inconsistent spacing your program will not run.
from functools import reduce
def encode_decode(s):
return reduce(lambda x,y: x+codebook[y],s,"")
encrypted = encode_decode('This is a secret message'); encrypted
encode_decode(encrypted)
Unlike R, Python was not designed for statistical analysis. Python was designed as a general purpose high level programming language. However, one of Python's strongest features is an truly vast collection of easy to use libraries (called modules) that drastically simplify our lives.
Two key core pieces of R functionality are lacking. We do not have an analogue of vectors (efficient lists containing only one type of element), so we are also lacking matrices and tensors, which are just fancier vectors. We are also lacking the data frame abstraction which plays a central role in R.
Vector functionality comes from numpy which is usually imported as np. This provides fast vectors and vectorized operations and should be used when possible instead of lists of numerical data. Dataframes come from pandas which is usually imported as pd. Pandas builds on numpy and is part of the scipy ecosystem, which includes many numerical libraries including more advanced statistics and linear algebra functions. The scipy ecosystem also includes matplotlib which is a pretty complex/flexible plotting library. I should also mention scikit-learn which is a standard machine learning library (although surprisingly limited) is built on scipy.
import numpy as np
a=np.arange(10)
np.sin(a) # vectorized operation
A useful numpy feature (although it takes some getting used to) is broadcasting, which is similar to functionality in R, which automatically converts an array of one shape into another shape when performing various operations according to these rules. Broadcasting can easiliy lead to bugs and confusion, so try to be careful.
a*2
list(range(10))*2
a*a
a.a
a.shape
b=a.reshape(10,1)
b
b.T
b.T.shape
c=np.dot(a,b); c
c.shape
d=np.zeros(shape=(2,3)); d
e = np.ones_like(d); e
f = np.ndarray(shape = (2,3,4), buffer = np.array(list(range(24))),dtype = np.int)
f
f[1,2,3]
f[1,1:3,3]
f[:,1:3,3]
for x in f:
print(x)
for outer in f:
for inner in outer:
for really_inner in inner:
print(really_inner)
import pandas as pd
df = pd.read_csv("crypto-markets.csv")
df.head()
df.symbol.unique()
len(df.symbol.unique())
df['symbol'].unique()
small_df = df.head(25)
small_df
small_df[['date', 'close']]
small_df[4:6]
small_df[4] # fails
small_df.loc[4]
small_df.loc[4,"open"]
small_df.iloc[4,4]
Pay attention to the syntax for referencing. Think of the loc and iloc objects as dictionaries which will pull up the relevant pieces of the data frame and allow slicing notation (which is now inclusive on both ends). The difference is that loc searches by name and iloc only searches by numerical index.
type(small_df.loc[4:4])
type(small_df.loc[4])
df['date'] = pd.to_datetime(df['date'])
df['date'].head()
Select only the first few symbols.
mask = df['symbol'].isin(df['symbol'].unique()[1:5])
trim_df = df[mask]
from ggplot import *
gg = ggplot(aes(x='date',y='close',color='symbol'),data = trim_df) + geom_line() + ggtitle("Cryptocurrency prices") + scale_y_log() + \
scale_x_date() + ylab("Closing price (log-scale)") + xlab("Date")
gg.show()