» 2021 » January Option Fanatic

Review of Python Courses (Part 22)

Posted by Mark on January 29, 2021 at 07:31 | Last modified: February 9, 2021 13:29

In Part 21, I summarized my Datacamp courses 62-64. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #65 was Reshaping Data with pandas. This course covers:

Wide and long formats
Reshaping using pivot method
Pivot tables
Reshaping with melt
Wide to long function
Working with string columns
Stacking dataframes
Unstacking dataframes
Working with multiple levels
Handling missing data
Reshaping and combining data
Transforming a list-like column
Reading nested data into a dataframe (from pandas import json_normalize)
Dealing with nested data columns

My course #66 was Building Data Engineering Pipelines in Python. For some reason, these data engineering courses did not sit well with me and much of this sailed over my head. This course covers:

Components of a data platform
Introduction to data ingestion with Singer
Running an ingestion pipeline with Singer
Basic introduction to PySpark (from pyspark.sql import SparkSession)
Cleaning data
Transforming data with Spark
Packaging your application
On the importance of tests
Writing unit tests for PySpark
Continuous testing
Modern day workflow management
Building a data pipeline with Airflow (from airflow.operators.bash_operator import BashOperator)
Deploying Airflow (from airflow.models import DagBag)

My course #67 was Importing and Managing Financial Data in Python. This course covers:

Reading, inspecting, and cleaning data from CSV (parse_dates explained)
Read data from Excel worksheets
Combine data from multiple worksheets (importing market data from multiple Excel files)
The DataReader: access financial data online (from pandas_datareader.data import DataReader)
Economic data from the Federal Reserve
Select stocks and get data from Google Finance
Get several stocks and manage a MultiIndex
Summarize your data with descriptive stats
Describe the distribution of your data with quantiles (np.arange() to .describe() with constant-step percentiles)
Visualize the distribution of your data [ax = sns.distplot(df)]
Summarize categorical variables
Aggregate your data by category
Summary statistics by category with seaborn [sns.countplot()]
Distributions by category with seaborn [sns.boxplot(), sns.swarmplot()]

I will review more courses next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 21)

Posted by Mark on January 26, 2021 at 07:10 | Last modified: February 8, 2021 14:22

In Part 20, I summarized my Datacamp courses 59-61. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #62 was Time Series Analysis in Python. This clearly has potential applications for investment returns, but in the end I wasn’t totally sure what those might be. The course covers:

Introduction to the course
Correlation of two time series
Simple linear regressions (in statsmodels, numpy, pandas, scipy)
Autocorrelation (convert index to datetime)
Autocorrelation function (from statsmodels.graphics.tsaplots import plot_acf; from statsmodels.tsa.stattools import acf)
White noise
Random walk (from statsmodels.tsa.stattools import adfuller)
Stationarity
Introducing an AR model (from statsmodels.tsa.arima_process import ArmaProcess)
Estimating and forecasting an AR model
Choosing the right model (from statsmodels.graphics.tsaplots import plot_pacf)
Estimation and forecasting an MA model
ARMA models
Cointegration models
Case study: climate change

My course #63 was Intermediate Predictive Analytics in Python. This course covers:

The basetable timeline
The population
The target
Adding predictive variables
Adding aggregated variables
Adding evolutions
Using evolution variables
Creating dummies (avoiding multicollinearity)
Missing values (list comprehension)
Handling outliers (from scipy.stats.mstats import winsorize)
Transformations
Seasonality
Using multiple snapshots
The timegap

My course #64 was Building and Distributing Packages with Conda. This is another shell-related course I found hard to absorb since I do very little in the shell. I’m not the only newbie who feels this way, either. This was a recent post to the group:

> I have been doing Python courses for a while but now I actually wanna try some real
> live data on my laptop and I am not sure on how to install all of the needed stuff
> (pandas, numpy, etc.). I have downloaded the latest Python version and the PyCharm
> editor but… [the courses] do not really have anything to show you how to actually
> make the rest of the things work for inexperienced people such as myself.

I downloaded Spyder IDE, which has met most of my needs. It crashes sometimes and gives repetitive errors upon start-up, though, which are both quite annoying. I’ve also had mixed results downloading some libraries like Backtester.

Speaking of Anaconda, or conda for short, my 64th course covers:

Anaconda Project
Anaconda Project specification file
Anaconda Project commands
Python module and packages
Python package directory
Conda packages
Conda package dependencies

I will review more courses next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 20)

Posted by Mark on January 21, 2021 at 07:00 | Last modified: February 8, 2021 10:04

In Part 19, I summarized my Datacamp courses 56-58. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #59 was Dealing with Missing Data in Python. This course covers:

Why deal with missing data (built-in Python NoneType vs. np.nan)?
Handling missing values
Analyze the amount of missingness (import missingno as msno)
Is the data missing at random?
Finding patterns in missing data
Visualizing missingness across a variable
When and how to delete missing data
Mean, median, and mode imputations (from sklearn.impute import SimpleImputer)
Imputing time-series data
Visualizing time-series imputations
Imputing using fancyimpute (from fancyimpute import KNN; from fancyimpute import IterativeImputer)
Imputing categorical values
Evaluation of different imputation techniques

My course #60 was Intermediate Python for Finance. This course covers:

Representing time with datetimes
Working with datetimes
Dictionaries
Comparison operators
Boolean operators
If statements (with dictionary)
For and while loops
Creating a dataframe
Accessing data
Aggregating and summarizing
Extending and manipulating data
Peeking at data with head, tail, and describe
Filtering data
Plotting data

My course #61 was Object-Oriented Programming in Python. These OOP-related courses were really confusing to me the first time through. This course covers:

What is OOP?
Class anatomy: attributes and methods
Class anatomy: the __init__ constructor
Instance and class data
Class inheritance
Customizing functionality via inheritance
Operator overloading: comparison
Operator overloading: string representation
Exceptions (try – except – finally)
Designing for inheritance and polymorphism (Liskov substitution principle)
Managing data access: private attributes
Properties

I will review more courses next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 19)

Posted by Mark on January 19, 2021 at 07:12 | Last modified: February 6, 2021 04:54

In Part 18, I summarized my Datacamp courses 53-55. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #56 was Writing Efficient Code with pandas. This course covers:

The need for efficient coding (time.time(), list comprehensions faster than for loop)
Locate rows: .iloc[] (generally faster for rows) and .loc[] (generally faster for columns)
Select random rows (built-in sample() function faster than numpy random integer generator)
Replace scalar values using .replace() (much faster than using .loc[] to find values and reassigning them)
Replace values using lists (.replace() faster than using .loc[] )
Replace values using dictionaries (faster than using lists)
Looping through the .iterrows() function [for loop using .range() is faster than the smarter/cleaner/optimized .iterrows()]
Looping through the .apply() function (faster iterating along rows while native pandas .sum() faster along columns)
Vectorization over pandas series [vectorization method .apply() works faster than .iterrows()]
Vectorization using NumPy arrays using .values() (summing arrays is faster than summing series)
Data transformation using .groupby().transform (.transform() cleaner and much faster than native Python code)
Missing value imputation using .transform() (.transform() much faster than native Python code)
Data filtration using the .filter() function (.groupby().filter() faster than list comprehension + for loop)

My course #57 was Credit Risk Modeling in Python. This course covers:

Understanding credit risk
Outliers in credit data
Risk with missing data in loan data (finding, counting, and replacing missing data)
Logistic regression for probability of default
Predicting the probability of default
Credit model performance
Model discrimination and impact
Gradient boosted trees with XGBoost
Column selection for credit risk
Cross validation for credit models
Class imbalance in loan data
Model evaluation and implementation (from sklearn.calibration import calibration_curve)
Credit acceptance rates
Credit strategy and maximum expected loss

My course #58 was Analyzing IoT Data in Python. This course covers:

Introduction to IoT data
Understand the data
Introduction to data streams (import paho.mqtt.subscribe as subscribe)
Perform EDA
Clean data
Gather minimalistic incremental data
Prepare and visualize incremental data
Combining data sources for further analysis
Correlation
Outliers (from statsmodels.graphics import tsaplots)
Seasonality and trends
Prepare data for machine learning
Scaling data for machine learning
Develop machine learning pipeline (from sklearn.pipeline import Pipeline)
Apply a machine learning model

I will review more classes next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 18)

Posted by Mark on January 15, 2021 at 07:15 | Last modified: February 5, 2021 10:08

In Part 17, I summarized my Datacamp courses 50-52. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #53 was Introduction to Python for Finance. This course covers:

Why Python for finance?
Comments and variables
Variable data types
Lists in Python
Lists in lists
Methods and functions
Arrays (probably best for financial analysis)
Two dimensional arrays
Using arrays for analyses (indexing arrays—might work in place of .loc or .iloc?)
Visualization in Python
Histograms (normed arg)
Introducing the dataset
Closer look at the sectors
Visualizing trends

My course #54 was Experimental Design in Python. This course covers:

Intro to experimental design (import plotnine as p9)
Our first hypothesis test—Student’s t-test (from scipy import stats)
Testing proportion and correlation [stats.chisquare(), stats.fisher_exact(), stats.pearsonr()]
Confounding variables
Blocking and randomization (random sampling)
ANOVA [import statsmodels as sm, stats.f_oneway()]
Interactive effects (two- and three-way ANOVAs)
Type I error (Bonferroni and Šidák correction for multiple comparisons)
Sample size (from statsmodels.stats import power as pwr)
Power
Assumptions and normal distributions (Q-Q plot)
Testing for normality [from scipy import stats, stats.shapiro()]
Non-parametric tests: Wilcoxon rank-sum and signed-rank (paired) test
More non-parametric tests: Spearman correlation

My course #55 was Introduction to Data Engineering. For some reason, these data engineering courses are not my cup of tea. This course covers:

What is data engineering?
Tools of the data engineer (data engineers are expert users of database systems)
Cloud providers
Databases
Parallel computing (from multiprocessing import Pool) and computation frameworks
Workflow scheduling frameworks
Extract
Transform
Loading
Putting it all together
Case study: course ratings
From ratings to recommendations
Scheduling daily jobs

I will review more classes next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 17)

Posted by Mark on January 12, 2021 at 07:13 | Last modified: February 4, 2021 13:11

In Part 16, I summarized my Datacamp courses 47-49. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #50 was Introduction to Shell. This course covers:

How does the shell compare to a desktop interface?
Where am I and how can I identify files and directories?
How can I move to another directory (~ is home)?
How to copy, rename, move, and delete files
How to create and delete directories
How to view file contents
Modifying commands with flags
Getting help for a command
Selecting columns from a file
Repeating commands
Selecting lines with certain values
Storing command output to a file or using as input
Combining commands with pipe symbol
Counting records in a file
Specifying multiple files at once
Wildcards
Sorting lines of text and removing duplicate lines
How to stop a running program
Printing a variable’s value
How does the shell store information?
Repeating commands many times or once for each file
Recording names of a set of files
Variable’s name versus its value
Running many commands in a single loop
Using semicolons to do multiple things in a single loop
Editing a file
Saving commands to rerun later
Reusing pipes
Passing filenames to scripts
Processing a single argument
Writing loops in a shell script

My course #51 was Generalized Linear Models (GLM) in Python. This material is thick and really demands a third look (for me). This course covers:

Going beyond linear regression (import statsmodels.api as sm; from statsmodels.formula.api import glm)
How to build a GLM?
How to fit a GLM in Python?
Binary data and logistic regression (odds, odds ratio, and probability)
Interpreting coefficients
Interpreting model inference
Computing and describing predictions
Count data and Poisson distribution
Interpreting model fit
The problem of overdispersion
Multivariable logistic regression (from statsmodels.stats.outliers_influence import variance_inflation_factor)
Comparing models
Model formula (from patsy import dmatrix)
Categorical and interaction terms

My course #52 was Pandas Joins for Spreadsheet Users. This course covers:

Joining data: a real-world necessity
Concatenation
Power and flexibility
Types of joins
A closer look at one-to-one joins
Combining common data with inner joins
“Out of many, one”
Joining on key columns
Index-based joins
Joining data in real life
Working with time data

I will review more classes next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 16)

Posted by Mark on January 7, 2021 at 07:01 | Last modified: February 4, 2021 09:07

In Part 15, I summarized my Datacamp courses 44-46. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #47 was Software Engineering for Data Scientists in Python. This is another relatively thick (for me) course with lots of object-oriented stuff:

Python, data science, and software engineering
Introduction to packages and documentation
Conventions and PEP 8
Writing your first package
Adding functionality to packages
Making your package portable
Adding classes to a package
Leveraging classes
Classes and the DRY principle
Multilevel inheritance
Documentation
Readability counts
Unit testing
Documentation and testing in practice (Sphinx, Travis CI, GitHub/GitLab, Codecov, Code Climate)

My course #48 was Analyzing Marketing Campaigns with pandas. This course covers:

Introduction to pandas for marketing
Data types and merging (.astype(), np.where(), .map(), pd.to_datetime() to avoid slower miscategorization as object)
Initial exploratory analysis
Introduction to common marketing metrics (count number criteria B from column sliced based on criterion A)
Customer segmentation (alternate method to format as percentage)
Plotting campaign results
Building functions to automate analysis (plotting function)
Identifying inconsistencies
Resolving inconsistencies
A/B testing for marketing
Calculating lift and significance testing (from scipy.stats import ttest_ind)
A/B testing and segmentation (function with statistical test)

My course #49 was Conda Essentials. This course covers:

What are packages and why are they needed?
What version of conda do I have?
Install a conda package
What is semantic versioning?
Install a specific version of a package
Update/remove/search for a conda package
Find dependencies for a package version
Channels and why they are needed (as means for a user to publish packages independently)
Which environment am I using?
Remove, create new, and exporting environments
Compatibility with different versions
Updating a script

I will review more classes next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 15)

Posted by Mark on January 4, 2021 at 07:23 | Last modified: February 3, 2021 13:57

In Part 14, I summarized my Datacamp courses 41-43. Today I will continue with the next three.

As a reminder, I introduced you to my recent work learning Python here.

My course #44 was Python for Spreadsheet Users. This course covers:

Welcome to Python
Dataframes and their methods
Filtering rows and creating columns
Grouping and summing: the beginner’s pivot table
Grouping by multiple columns
More ways to condense information (using .groupby().head() to get top rows of each group)
Working with multiple sheets [pd.ExcelFile(), .sheet_names attribute, .parse()]
Preparing to put tables together
Merging: the VLOOKUP of Python
How visualization works in Python
Building up the barplot
The power of hue

My course #45 was Preprocessing for Machine Learning in Python. This course covers:

Preprocessing data for machine learning (count number of missing values in column)
Working with data types (converting column types)
Training and test sets (stratified sampling with train_test_split)
Standardizing data and log normalization
Scaling data (from sklearn.preprocessing import StandardScaler)
Standardized data and modeling (from sklearn.neighbors import KNeighborsClassifier)
Feature engineering
Encoding categorical variables [lambda function, from sklearn.preprocessing import LabelEncoder, pd.get_dummies()]
Engineering numerical features
Engineering features from text (from sklearn.feature_extraction.text import TfidfVectorizer)
Feature selection
Removing redundant features
Selecting features using text vectors
Dimensionality reduction (from sklearn.decomposition import PCA)
UFOs and preprocessing

My course #46 was Cluster Analysis in Python. This course covers:

Unsupervised learning: basics
Basics of cluster analysis (from scipy.cluster.hierarchy import linkage, fcluster; from scipy.cluster.vq import kmeans, vq)
Data preparation for cluster analysis (from scipy.cluster.vq import whiten)
Basics of hierarchical clustering
Visualize clusters
How many clusters (from scipy.cluster.hierarchy import dendrogram)?
Limitations of hierarchical clustering
Basics of k-means clustering
How many clusters?
Limitations of k-means clustering
Dominant colors in images (import matplotlib.image as img)
Document clustering
Clustering with multiple features

I will review more classes next time.

Categories: Python | Comments (0) | Permalink

Review of Python Courses (Part 22)

Review of Python Courses (Part 21)

Review of Python Courses (Part 20)

Review of Python Courses (Part 19)

Review of Python Courses (Part 18)

Review of Python Courses (Part 17)

Review of Python Courses (Part 16)

Review of Python Courses (Part 15)

Pages

Recent Posts

Categories