Review of Python Courses (Part 23)
Posted by Mark on February 1, 2021 at 07:34 | Last modified: February 10, 2021 10:35In Part 22, I summarized my Datacamp courses 65-67. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #68 was Linear Classifiers in Python. This course covers:
- Introduction (import sklearn.datasets)
- Applying logistic regression and SVM (general process, from sklearn.svm import LinearSVC)
- Linear decision boundaries
- Linear classifiers: prediction equations
- What is a loss function (from scipy.optimize import minimize)?
- Loss function diagrams
- Logistic regression and regularization
- Logistic regression and probabilities
- Multi-class logistic regression
- Support vectors
- Kernel SVMs
- Comparing logistic regression and SVM (from sklearn.linear_model import SGDClassifier)
>
My course #69 was Analyzing Social Media Data in Python. While I found this somewhat interesting, it seemed to incorporate as much JSON as it did Python. I have a hard enough time studying one new language—adding a second on top of that made things even more confusing for me:
- Analyzing Twitter data
- Collecting data through the Twitter API (from tweepy import Stream, OAuthHandler, API)
- Understanding Twitter JSON
- Processing Twitter text
- Counting words
- Time series
- Sentiment analysis
- Twitter networks
- Importing and visualizing Twitter networks (import networkx as nx)
- Node-level metrics
- Maps and Twitter data
- Geographical data in Twitter JSON
- Creating Twitter maps (from mpl_toolkits.basemap import Basemap)
>
My course #70 was Fraud Detection in Python. This course covers:
- Introduction to fraud detection
- Increasing successful detections using data resampling (from imblearn.over_sampling import RandomOverSampler)
- Fraud detection algorithms in action (from imblearn.pipeline import Pipeline)
- Review of classification methods
- Performance evaluation (from sklearn.metrics import precision_recall_curve, average_precision_score)
- More performance evaluation (from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score)
- Adjusting your algorithm weights
- Performance evaluation (from sklearn.model_selection import GridSearchCV)
- Ensemble methods (from sklearn.ensemble import VotingClassifier)
- Normal versus abnormal behavior
- Clustering methods (from sklearn.preprocessing import MinMaxScaler; from sklearn.cluster import MiniBatchKMeans)
- Assigning fraud versus non-fraud
- Other clustering fraud detection methods (from sklearn.cluster import DBSCAN)
- Using text data (from nltk import word_tokenize; import string)
- Text mining to detect fraud (from nltk.corpus import stopwords; from nltk.stem.wordnet import WordNetLemmatizer)
- Topic modeling on fraud (from gensim import corpora)
- Flagged fraud based on topics (import pyLDAvis.gensim for use with Jupyter Notebooks only)
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 22)
Posted by Mark on January 29, 2021 at 07:31 | Last modified: February 9, 2021 13:29In Part 21, I summarized my Datacamp courses 62-64. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #65 was Reshaping Data with pandas. This course covers:
- Wide and long formats
- Reshaping using pivot method
- Pivot tables
- Reshaping with melt
- Wide to long function
- Working with string columns
- Stacking dataframes
- Unstacking dataframes
- Working with multiple levels
- Handling missing data
- Reshaping and combining data
- Transforming a list-like column
- Reading nested data into a dataframe (from pandas import json_normalize)
- Dealing with nested data columns
>
My course #66 was Building Data Engineering Pipelines in Python. For some reason, these data engineering courses did not sit well with me and much of this sailed over my head. This course covers:
- Components of a data platform
- Introduction to data ingestion with Singer
- Running an ingestion pipeline with Singer
- Basic introduction to PySpark (from pyspark.sql import SparkSession)
- Cleaning data
- Transforming data with Spark
- Packaging your application
- On the importance of tests
- Writing unit tests for PySpark
- Continuous testing
- Modern day workflow management
- Building a data pipeline with Airflow (from airflow.operators.bash_operator import BashOperator)
- Deploying Airflow (from airflow.models import DagBag)
>
My course #67 was Importing and Managing Financial Data in Python. This course covers:
- Reading, inspecting, and cleaning data from CSV (parse_dates explained)
- Read data from Excel worksheets
- Combine data from multiple worksheets (importing market data from multiple Excel files)
- The DataReader: access financial data online (from pandas_datareader.data import DataReader)
- Economic data from the Federal Reserve
- Select stocks and get data from Google Finance
- Get several stocks and manage a MultiIndex
- Summarize your data with descriptive stats
- Describe the distribution of your data with quantiles (np.arange() to .describe() with constant-step percentiles)
- Visualize the distribution of your data [ax = sns.distplot(df)]
- Summarize categorical variables
- Aggregate your data by category
- Summary statistics by category with seaborn [sns.countplot()]
- Distributions by category with seaborn [sns.boxplot(), sns.swarmplot()]
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 21)
Posted by Mark on January 26, 2021 at 07:10 | Last modified: February 8, 2021 14:22In Part 20, I summarized my Datacamp courses 59-61. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #62 was Time Series Analysis in Python. This clearly has potential applications for investment returns, but in the end I wasn’t totally sure what those might be. The course covers:
- Introduction to the course
- Correlation of two time series
- Simple linear regressions (in statsmodels, numpy, pandas, scipy)
- Autocorrelation (convert index to datetime)
- Autocorrelation function (from statsmodels.graphics.tsaplots import plot_acf; from statsmodels.tsa.stattools import acf)
- White noise
- Random walk (from statsmodels.tsa.stattools import adfuller)
- Stationarity
- Introducing an AR model (from statsmodels.tsa.arima_process import ArmaProcess)
- Estimating and forecasting an AR model
- Choosing the right model (from statsmodels.graphics.tsaplots import plot_pacf)
- Estimation and forecasting an MA model
- ARMA models
- Cointegration models
- Case study: climate change
>
My course #63 was Intermediate Predictive Analytics in Python. This course covers:
- The basetable timeline
- The population
- The target
- Adding predictive variables
- Adding aggregated variables
- Adding evolutions
- Using evolution variables
- Creating dummies (avoiding multicollinearity)
- Missing values (list comprehension)
- Handling outliers (from scipy.stats.mstats import winsorize)
- Transformations
- Seasonality
- Using multiple snapshots
- The timegap
>
My course #64 was Building and Distributing Packages with Conda. This is another shell-related course I found hard to absorb since I do very little in the shell. I’m not the only newbie who feels this way, either. This was a recent post to the group:
> I have been doing Python courses for a while but now I actually wanna try some real
> live data on my laptop and I am not sure on how to install all of the needed stuff
> (pandas, numpy, etc.). I have downloaded the latest Python version and the PyCharm
> editor but… [the courses] do not really have anything to show you how to actually
> make the rest of the things work for inexperienced people such as myself.
I downloaded Spyder IDE, which has met most of my needs. It crashes sometimes and gives repetitive errors upon start-up, though, which are both quite annoying. I’ve also had mixed results downloading some libraries like Backtester.
Speaking of Anaconda, or conda for short, my 64th course covers:
- Anaconda Project
- Anaconda Project specification file
- Anaconda Project commands
- Python module and packages
- Python package directory
- Conda packages
- Conda package dependencies
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 20)
Posted by Mark on January 21, 2021 at 07:00 | Last modified: February 8, 2021 10:04In Part 19, I summarized my Datacamp courses 56-58. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #59 was Dealing with Missing Data in Python. This course covers:
- Why deal with missing data (built-in Python NoneType vs. np.nan)?
- Handling missing values
- Analyze the amount of missingness (import missingno as msno)
- Is the data missing at random?
- Finding patterns in missing data
- Visualizing missingness across a variable
- When and how to delete missing data
- Mean, median, and mode imputations (from sklearn.impute import SimpleImputer)
- Imputing time-series data
- Visualizing time-series imputations
- Imputing using fancyimpute (from fancyimpute import KNN; from fancyimpute import IterativeImputer)
- Imputing categorical values
- Evaluation of different imputation techniques
>
My course #60 was Intermediate Python for Finance. This course covers:
- Representing time with datetimes
- Working with datetimes
- Dictionaries
- Comparison operators
- Boolean operators
- If statements (with dictionary)
- For and while loops
- Creating a dataframe
- Accessing data
- Aggregating and summarizing
- Extending and manipulating data
- Peeking at data with head, tail, and describe
- Filtering data
- Plotting data
>
My course #61 was Object-Oriented Programming in Python. These OOP-related courses were really confusing to me the first time through. This course covers:
- What is OOP?
- Class anatomy: attributes and methods
- Class anatomy: the __init__ constructor
- Instance and class data
- Class inheritance
- Customizing functionality via inheritance
- Operator overloading: comparison
- Operator overloading: string representation
- Exceptions (try – except – finally)
- Designing for inheritance and polymorphism (Liskov substitution principle)
- Managing data access: private attributes
- Properties
>
I will review more courses next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 19)
Posted by Mark on January 19, 2021 at 07:12 | Last modified: February 6, 2021 04:54In Part 18, I summarized my Datacamp courses 53-55. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #56 was Writing Efficient Code with pandas. This course covers:
- The need for efficient coding (time.time(), list comprehensions faster than for loop)
- Locate rows: .iloc[] (generally faster for rows) and .loc[] (generally faster for columns)
- Select random rows (built-in sample() function faster than numpy random integer generator)
- Replace scalar values using .replace() (much faster than using .loc[] to find values and reassigning them)
- Replace values using lists (.replace() faster than using .loc[] )
- Replace values using dictionaries (faster than using lists)
- Looping through the .iterrows() function [for loop using .range() is faster than the smarter/cleaner/optimized .iterrows()]
- Looping through the .apply() function (faster iterating along rows while native pandas .sum() faster along columns)
- Vectorization over pandas series [vectorization method .apply() works faster than .iterrows()]
- Vectorization using NumPy arrays using .values() (summing arrays is faster than summing series)
- Data transformation using .groupby().transform (.transform() cleaner and much faster than native Python code)
- Missing value imputation using .transform() (.transform() much faster than native Python code)
- Data filtration using the .filter() function (.groupby().filter() faster than list comprehension + for loop)
>
My course #57 was Credit Risk Modeling in Python. This course covers:
- Understanding credit risk
- Outliers in credit data
- Risk with missing data in loan data (finding, counting, and replacing missing data)
- Logistic regression for probability of default
- Predicting the probability of default
- Credit model performance
- Model discrimination and impact
- Gradient boosted trees with XGBoost
- Column selection for credit risk
- Cross validation for credit models
- Class imbalance in loan data
- Model evaluation and implementation (from sklearn.calibration import calibration_curve)
- Credit acceptance rates
- Credit strategy and maximum expected loss
>
My course #58 was Analyzing IoT Data in Python. This course covers:
- Introduction to IoT data
- Understand the data
- Introduction to data streams (import paho.mqtt.subscribe as subscribe)
- Perform EDA
- Clean data
- Gather minimalistic incremental data
- Prepare and visualize incremental data
- Combining data sources for further analysis
- Correlation
- Outliers (from statsmodels.graphics import tsaplots)
- Seasonality and trends
- Prepare data for machine learning
- Scaling data for machine learning
- Develop machine learning pipeline (from sklearn.pipeline import Pipeline)
- Apply a machine learning model
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 18)
Posted by Mark on January 15, 2021 at 07:15 | Last modified: February 5, 2021 10:08In Part 17, I summarized my Datacamp courses 50-52. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #53 was Introduction to Python for Finance. This course covers:
- Why Python for finance?
- Comments and variables
- Variable data types
- Lists in Python
- Lists in lists
- Methods and functions
- Arrays (probably best for financial analysis)
- Two dimensional arrays
- Using arrays for analyses (indexing arrays—might work in place of .loc or .iloc?)
- Visualization in Python
- Histograms (normed arg)
- Introducing the dataset
- Closer look at the sectors
- Visualizing trends
>
My course #54 was Experimental Design in Python. This course covers:
- Intro to experimental design (import plotnine as p9)
- Our first hypothesis test—Student’s t-test (from scipy import stats)
- Testing proportion and correlation [stats.chisquare(), stats.fisher_exact(), stats.pearsonr()]
- Confounding variables
- Blocking and randomization (random sampling)
- ANOVA [import statsmodels as sm, stats.f_oneway()]
- Interactive effects (two- and three-way ANOVAs)
- Type I error (Bonferroni and Šidák correction for multiple comparisons)
- Sample size (from statsmodels.stats import power as pwr)
- Power
- Assumptions and normal distributions (Q-Q plot)
- Testing for normality [from scipy import stats, stats.shapiro()]
- Non-parametric tests: Wilcoxon rank-sum and signed-rank (paired) test
- More non-parametric tests: Spearman correlation
>
My course #55 was Introduction to Data Engineering. For some reason, these data engineering courses are not my cup of tea. This course covers:
- What is data engineering?
- Tools of the data engineer (data engineers are expert users of database systems)
- Cloud providers
- Databases
- Parallel computing (from multiprocessing import Pool) and computation frameworks
- Workflow scheduling frameworks
- Extract
- Transform
- Loading
- Putting it all together
- Case study: course ratings
- From ratings to recommendations
- Scheduling daily jobs
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 17)
Posted by Mark on January 12, 2021 at 07:13 | Last modified: February 4, 2021 13:11In Part 16, I summarized my Datacamp courses 47-49. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #50 was Introduction to Shell. This course covers:
- How does the shell compare to a desktop interface?
- Where am I and how can I identify files and directories?
- How can I move to another directory (~ is home)?
- How to copy, rename, move, and delete files
- How to create and delete directories
- How to view file contents
- Modifying commands with flags
- Getting help for a command
- Selecting columns from a file
- Repeating commands
- Selecting lines with certain values
- Storing command output to a file or using as input
- Combining commands with pipe symbol
- Counting records in a file
- Specifying multiple files at once
- Wildcards
- Sorting lines of text and removing duplicate lines
- How to stop a running program
- Printing a variable’s value
- How does the shell store information?
- Repeating commands many times or once for each file
- Recording names of a set of files
- Variable’s name versus its value
- Running many commands in a single loop
- Using semicolons to do multiple things in a single loop
- Editing a file
- Saving commands to rerun later
- Reusing pipes
- Passing filenames to scripts
- Processing a single argument
- Writing loops in a shell script
>
My course #51 was Generalized Linear Models (GLM) in Python. This material is thick and really demands a third look (for me). This course covers:
- Going beyond linear regression (import statsmodels.api as sm; from statsmodels.formula.api import glm)
- How to build a GLM?
- How to fit a GLM in Python?
- Binary data and logistic regression (odds, odds ratio, and probability)
- Interpreting coefficients
- Interpreting model inference
- Computing and describing predictions
- Count data and Poisson distribution
- Interpreting model fit
- The problem of overdispersion
- Multivariable logistic regression (from statsmodels.stats.outliers_influence import variance_inflation_factor)
- Comparing models
- Model formula (from patsy import dmatrix)
- Categorical and interaction terms
>
My course #52 was Pandas Joins for Spreadsheet Users. This course covers:
- Joining data: a real-world necessity
- Concatenation
- Power and flexibility
- Types of joins
- A closer look at one-to-one joins
- Combining common data with inner joins
- “Out of many, one”
- Joining on key columns
- Index-based joins
- Joining data in real life
- Working with time data
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 16)
Posted by Mark on January 7, 2021 at 07:01 | Last modified: February 4, 2021 09:07In Part 15, I summarized my Datacamp courses 44-46. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #47 was Software Engineering for Data Scientists in Python. This is another relatively thick (for me) course with lots of object-oriented stuff:
- Python, data science, and software engineering
- Introduction to packages and documentation
- Conventions and PEP 8
- Writing your first package
- Adding functionality to packages
- Making your package portable
- Adding classes to a package
- Leveraging classes
- Classes and the DRY principle
- Multilevel inheritance
- Documentation
- Readability counts
- Unit testing
- Documentation and testing in practice (Sphinx, Travis CI, GitHub/GitLab, Codecov, Code Climate)
>
My course #48 was Analyzing Marketing Campaigns with pandas. This course covers:
- Introduction to pandas for marketing
- Data types and merging (.astype(), np.where(), .map(), pd.to_datetime() to avoid slower miscategorization as object)
- Initial exploratory analysis
- Introduction to common marketing metrics (count number criteria B from column sliced based on criterion A)
- Customer segmentation (alternate method to format as percentage)
- Plotting campaign results
- Building functions to automate analysis (plotting function)
- Identifying inconsistencies
- Resolving inconsistencies
- A/B testing for marketing
- Calculating lift and significance testing (from scipy.stats import ttest_ind)
- A/B testing and segmentation (function with statistical test)
>
My course #49 was Conda Essentials. This course covers:
- What are packages and why are they needed?
- What version of conda do I have?
- Install a conda package
- What is semantic versioning?
- Install a specific version of a package
- Update/remove/search for a conda package
- Find dependencies for a package version
- Channels and why they are needed (as means for a user to publish packages independently)
- Which environment am I using?
- Remove, create new, and exporting environments
- Compatibility with different versions
- Updating a script
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 15)
Posted by Mark on January 4, 2021 at 07:23 | Last modified: February 3, 2021 13:57In Part 14, I summarized my Datacamp courses 41-43. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #44 was Python for Spreadsheet Users. This course covers:
- Welcome to Python
- Dataframes and their methods
- Filtering rows and creating columns
- Grouping and summing: the beginner’s pivot table
- Grouping by multiple columns
- More ways to condense information (using .groupby().head() to get top rows of each group)
- Working with multiple sheets [pd.ExcelFile(), .sheet_names attribute, .parse()]
- Preparing to put tables together
- Merging: the VLOOKUP of Python
- How visualization works in Python
- Building up the barplot
- The power of hue
>
My course #45 was Preprocessing for Machine Learning in Python. This course covers:
- Preprocessing data for machine learning (count number of missing values in column)
- Working with data types (converting column types)
- Training and test sets (stratified sampling with train_test_split)
- Standardizing data and log normalization
- Scaling data (from sklearn.preprocessing import StandardScaler)
- Standardized data and modeling (from sklearn.neighbors import KNeighborsClassifier)
- Feature engineering
- Encoding categorical variables [lambda function, from sklearn.preprocessing import LabelEncoder, pd.get_dummies()]
- Engineering numerical features
- Engineering features from text (from sklearn.feature_extraction.text import TfidfVectorizer)
- Feature selection
- Removing redundant features
- Selecting features using text vectors
- Dimensionality reduction (from sklearn.decomposition import PCA)
- UFOs and preprocessing
>
My course #46 was Cluster Analysis in Python. This course covers:
- Unsupervised learning: basics
- Basics of cluster analysis (from scipy.cluster.hierarchy import linkage, fcluster; from scipy.cluster.vq import kmeans, vq)
- Data preparation for cluster analysis (from scipy.cluster.vq import whiten)
- Basics of hierarchical clustering
- Visualize clusters
- How many clusters (from scipy.cluster.hierarchy import dendrogram)?
- Limitations of hierarchical clustering
- Basics of k-means clustering
- How many clusters?
- Limitations of k-means clustering
- Dominant colors in images (import matplotlib.image as img)
- Document clustering
- Clustering with multiple features
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 14)
Posted by Mark on December 31, 2020 at 07:48 | Last modified: February 2, 2021 17:19In Part 13, I summarized my Datacamp courses 38-40. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #41 was Advanced Deep Learning with Keras. This course covers:
- Keras input and dense layers (from keras.layers import Input, Dense)
- Keras models (from keras.utils import plot_model)
- Fit and evaluate a model (from keras.models import Model)
- Category embeddings (from keras.layers import Embedding, Flatten)
- Shared layers
- Merge layers (from keras.layers import Add, Subtract)
- Fitting and predicting with multiple inputs
- Three-input models (from keras.layers import Concatenate)
- Summarizing and plotting models
- Stacking models
- Two-output models
- Single model for classification and regression (from keras.optimizers import Adam)
>
My course #42 was Working with the Statistical Simulation in Python. This course covers:
- Intro to Object Oriented Programming in Python
- Simulation basics [np.random.choice()]
- Using simulation for decision making
- Probability basics
- More probability concepts
- Data generating process
- eCommerce ad simulation
- Introduction to resampling methods
- Bootstrapping
- Jackknife resampling
- Permutation testing
- Advanced applications of simulation
- Monte Carlo integration
- Simulation for power analysis
- Applications in finance
>
My course #43 was Introduction to Predictive Analytics in Python. This course covers:
- Introduction and basetable structure
- Logistic regression (from sklearn import linear_model)
- Using the logistic regression model
- Variable selection
- Forward stepwise variable selection
- Deciding on the number of variables (from sklearn.cross_validation import train_test_split)
- The cumulative gains curve (import scikitplot as skplt)
- The lift curve
- Guiding business to better decisions
- Predictor insight graphs
- Discretization of continuous variables (pd.qcut(), pd.cut(), check_discretize)
- Preparing the predictor insight graph table (create_pig_table)
- Plotting the predictor insight graph
>
I will review more classes next time.
Categories: Python | Comments (0) | Permalink