Review of Python Courses (Part 13)
Posted by Mark on December 29, 2020 at 07:48 | Last modified: February 2, 2021 11:44In Part 12, I summarized my Datacamp courses 35-37. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #38 was Web Scraping in Python. This gets complicated with some objected-oriented stuff that still throws me for a loop (no pun intended). I don’t think I will be using this anytime soon so I skimmed it in this review:
- Web scraping with Python
- HyperText Markup Language (HTML)
- HTML tags and attributes
- Crash course X
- Off the beaten XPath
- Introduction to the scrapy Selector (from scrapy import Selector)
- “Inspecting the HTML”
- CSS locators
- Attribute and text selection
- Getting ready to crawl
- Scraping for reals
- A classy spider (from scrapy.crawler import CrawlerProcess)
- A request for service
- Move your bloomin’ parse
- Capstone
>
My course #39 was Working with the Class System in Python. Like #38, this gets thick. The course covers:
- Intro to Object Oriented Programming (OOP) in Python
- Introduction to NumPy internals
- Introduction to objects and classes
- Deep dive on classes
- __Init__ializing a class
- Methods in classes
- Working with a dataset to create dataframes
- Renaming columns and the five-figure summary
- OOP best practices
- Inheritance: is-a versus has-a
- Inheritance with DataShells
- Composition
- Wrapping up OOP
>
My course #40 was Sentiment Analysis in Python. This course covers:
- What is sentiment analysis?
- Sentiment analysis types and approaches (from textblob import TextBlob)
- Let’s build a word cloud (from wordcloud import WordCloud)!
- Bag-of-words (from sklearn.feature_extraction.text import CountVectorizer)
- Getting granular with n-grams
- Build new features from text (from nltk import word_tokenize)
- Can you guess the language (from langdetect import detect_langs)?
- Stop words (from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS)
- Capturing a token pattern [.isalpha(), .isdigit(), .isalnum()]
- Stemming and lemmatization (from nltk.stem import PorterStemmer, WordNetLemmatizer)
- TfIdf: more ways to transform text (from sklearn.feature_extraction.text import TfidfVectorizer)
- Let’s predict the sentiment (from sklearn.linear_model import LogisticRegression)!
- Did we really predict the sentiment well (from sklearn.metrics import accuracy_score, confusion_matrix)?
- Logistic regression: revisited
- Bringing it all together
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 12)
Posted by Mark on December 24, 2020 at 07:39 | Last modified: February 1, 2021 15:43In Part 11, I summarized my Datacamp courses 31-34. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #35 was Intermediate Data Visualization with Seaborn. This course covers:
- Introduction to Seaborn [histogram vs. sns.distplot()]
- Using the distribution plot
- Regression plots in Seaborn [sns.regplot(), sns.lmplot()]
- Using Seaborn styles [sns.set_style(), sns.despine()]
- Colors in Seaborn
- Customizing with matplotlib (using Axes)
- Categorical plot types
- Regression plots [sns.regplot(), sns.residplot()]
- Matrix plots [sns.heatmap(pd.crosstab())]
- Using FacetGrid, factorplot, lmplot
- Using PairGrid and pairplot
- Using JointGrid and jointplot
- Selecting Seaborn plots
>
My course #36 was Introduction to Data Visualization with Seaborn (taking #35 before this was an oversight on my part, but everything ended up okay). This course covers:
- Introduction to Seaborn
- Using pandas with Seaborn
- Adding a third variable with hue
- Introduction to relational plots and subplots
- Customizing scatter plots
- Introduction to line plots
- Count plots and bar plots [sns.catplot()]
- Creating a box plot
- Point plots
- Changing plot style and color
- Adding titles and labels (FacetGrid vs. AxesSubplot)
>
My course #37 was Unsupervised Learning in Python. This course covers:
- Unsupervised learning (from sklearn.cluster import KMeans)
- Evaluating a clustering
- Transforming features for better clustering (from sklearn.preprocessing import StandardScaler)
- Visualizing hierarchies (from scipy.cluster.hierarchy import linkage, dendrogram)
- Cluster labels in hierarchical clustering
- t-SNE for 2-dimensional maps (from sklearn.manifold import TSNE)
- Visualizing the PCA transformation (from sklearn.decomposition import PCA)
- Intrinsic dimension
- Dimension reduction with PCA (from sklearn.decomposition import TruncatedSVD)
- Non-negative matrix factorization (NMF) (from sklearn.decomposition import NMF)
- NMF learns interpretable parts
- Building recommender systems using NMF (From sklearn.preprocessing import normalize)
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 11)
Posted by Mark on December 21, 2020 at 07:41 | Last modified: February 1, 2021 11:34In Part 10, I summarized my Datacamp courses 28-30. Today I will continue with the next four.
As a reminder, I introduced you to my recent work learning Python here.
My course #31 was Customer Analytics and A/B Testing in Python. This course covers:
- What is A/B testing?
- Identifying and understanding KPIs
- Exploratory analysis of KPIs
- Calculating KPIs—a practical example
- Working with time series data in pandas
- Creating time series graphs with matplotlib
- Understanding and visualizing trends in customer data
- Events and releases
- Introduction to A/B testing
- Initial A/B test design
- Preparing to run an A/B test
- Calculating sample size
- Analyzing the A/B test results
- Understanding statistical significance (get_pvalue, get_ci)
- Interpreting your test results
>
My course #32 was Machine Learning with Tree-Based Models in Python. This course covers:
- Decision-tree for classification (from sklearn.tree import DecisionTreeClassifier)
- Classification-tree learning
- Decision-tree for regression
- Generalization error (bias-variance tradeoff)
- Diagnosing bias and variance problems
- Ensemble learning
- Bagging (from sklearn.ensemble import BaggingClassifier)
- Out of bag evaluation
- Random forests
- AdaBoost (from sklearn.ensemble import AdaBoostClassifier)
- Gradient boosting (from sklearn.ensemble import GradientBoostingRegressor)
- Stochastic gradient boosting
- Tuning a CART’s hyperparameters
- Tuning an RF’s hyperparameters
>
My course #33 was Introduction to PySpark. This is a data engineering course—a field in which I found myself not very enthusiastic. This course covers:
- What is Spark, anyway?
- Using Spark in Python
- Using dataframes
- Joining
- Maching learning pipelines
- Data types
- Strings and factors
>
My course #34 was Cleaning Data with PySpark. This course covers:
- Intro to data cleaning with Apache Spark
- Immutability and lazy processing
- Understanding Parquet
- Dataframe column operations
- Conditional dataframe column operations
- User defined functions
- Partitioning and lazy processing
- Caching
- Improve import performance
- Cluster sizing tips
- Performance improvements
- Introduction to data pipelines
- Data handling techniques
- Data validation
- Final analysis and delivery
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 10)
Posted by Mark on December 18, 2020 at 07:25 | Last modified: January 29, 2021 14:36In Part 9, I summarized my Datacamp courses 25-27. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #28 was Supervised Learning with scikit-learn. This course covers:
- Supervised learning
- Exploratory data analysis [pd.plotting.scatter_matrix()]
- The classification challenge (creating arrays, from sklearn.neighbors import KNeighborsClassifier)
- Measuring model performance (from sklearn.model_selection import train_test_split, datasets)
- Introduction to regression (from sklearn.linear_model import LinearRegression)
- The basics of linear regression
- Cross-validation (from sklearn.model_selection import cross_val_score)
- Correlation
- Simple regression (from scipy.stats import linregress) and its limits
- Regularized regression (from sklearn.linear_model import Ridge, Lasso)
- How good is your model (from sklearn.metrics import classification_report, confusion_matrix)?
- Logistic regression and the ROC curve (from sklearn.metrics import roc_curve)
- Area under the ROC curve
- Hyperparameter tuning (from sklearn.model_selection import GridSearchCV)
- Hold-out set for final evaluation
- Preprocessing data [pd.get_dummies(df)]
- Handling missing data (from sklearn.preprocessing import Imputer, from sklearn.pipeline import Pipeline)
- Centering and scaling (from sklearn.preprocessing import scale, StandardScaler)
>
My course #29 was Introduction to Natural Language Processing in Python. This course covers:
- Introduction to regular expressions
- Introduction to tokenization (from nltk.tokenize import word_tokenize, sent_tokenize)
- Advanced tokenization with regex
- Charting word length with nltk
- Word counts with bag-of-words (from collections import Counter)
- Simple text preprocessing (from nltk.corpus import stopwords, from nltk.stem import WordNetLemmatizer)
- Introduction to gensim (from gensim.corpora.dictionary import Dictionary)
- Tf-idf with gensim (from gensim.models.tfidfmodel import TfidfModel)
- Named entity recognition
- Introduction to SpaCy
- Multilingual NER with polyglot (from polyglot.text import Text)
- Classifying fake news using supervised learning with NLP
- Building word count vectors (from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer)
- Training and testing a classification model with scikit-learn (from sklearn.naive_bayes import MultinomialNB)
- Simple NLP, complex problems
>
My course #30 was Building Chatbots in Python. This course covers:
- Introduction to conversational software (respond function, sleep method from time module)
- Creating a personality
- Text processing with regular expressions
- Understanding intents and entities (re.compile)
- Word vectors
- Intents and classification (from sklearn.svm import SVC)
- Entity extraction
- Robust NLU with Rasa (from rasa_nlu.converters import load_data)
- Virtual assistants and accessing data
- Exploring a DB with natural language
- Incremental slot filling and negation
- Stateful bots
- Asking questions and queuing answers
- Frontiers of dialog technology
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 9)
Posted by Mark on December 15, 2020 at 07:23 | Last modified: January 28, 2021 10:21In Part 8, I summarized my Datacamp courses 22-24. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #25 was Exploratory Data Analysis in Python (Part 2). This course covers:
- Dataframes and series
- Clean and validate (inplace arg)
- Filter and visualize
- Probability mass functions
- Cumulative distribution functions (probability < x)
- Comparing and modeling distributions
- Exploring (scatter plot: transparency, market size, jittering, zoom) and visualizing relationships (violin, box plot)
- Correlation
- Simple regression (from scipy.stats import linregress) and its limits
- Multiple regression
- Visualizing regression results
- Logistic regression
>
My course #26 was Regular Expressions in Python. Once into regex, this material gets very complex yet very powerful:
- Introduction to string manipulation
- String operations (selecting portions of a particular word)
- Finding and replacing
- Positional formatting (method to format percentages)
- Formatted string literal (escape sequences)
- Template method (from string import Template)
- Introduction to regular expressions
- Repetitions
- Regex metacharacters
- Greedy vs. non-greedy matching
- Alternation and non-capturing groups
- Backreferences
- Lookaround
>
My course #27 was Introduction to Deep Learning in Python. This course covers:
- Introduction to deep learning
- Forward propagation
- Activation functions
- Deeper networks
- The need for optimization
- Gradient descent
- Backpropagation [in practice]
- Creating a Keras model
- Compiling and fitting a model
- Classification models
- Using models
- Understanding model optimization
- Model validation
- Thinking about model capacity
- Stepping up to images
>
I will review more classes next time.
Categories: Python | Comments (0) | PermalinkReview of Python Courses (Part 8)
Posted by Mark on December 10, 2020 at 07:34 | Last modified: January 26, 2021 11:22In Part 7, I summarized my Datacamp courses 19-21. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #22 was Statistical Thinking in Python (Part 2). This course covers:
- Optimal parameters [statistical inference using scipy.stats, statsmodels, or hacker stats with numpy; plt.margins() ]
- Linear regression by least squares [slope, intercept = np.polyfit() ]
- The importance of exploratory data analysis: Anscombe’s quartet (generating and plotting line of best fit)
- Generating bootstrap replicates [ecdf() written in my course (prequel) #14]
- Bootstrap confidence intervals
- Pairs bootstrap
- Formulating and simulating a hypothesis (permutation sample)
- Test statistics and p-values (permutation replicate)
- Bootstrap hypothesis tests
- A/B testing
- Test of correlation
>
My course #23 was Introduction to Financial Concepts in Python. This course covers:
- Fundamental financial concepts (calculating return on investment and compound interest)
- Present and future value [np.pv(), np.fv() ]
- Net present value and cash flows [np.npv(rate= , values=np.array([]) ) ]
- Common profitability analysis methods [np.npv(), np.irr(np.array([]) ) ]
- Weighted average cost of capital
- Comparing two projects of different life spans (EAA)
- Mortgage basics [np.pmt(rate, nper, pv) ]
- Amortization, principal, and interest (simulating periodic mortgage payments)
- Home ownership, equity, and forecasting (cumulative operations in numpy)
- Budgeting project proposal [constant cumulative growth with np.repeat(), calculating monthly expenses]
- Net worth and valuation in your personal financial life
- The power of time and compound interest
>
My course #24 was Introduction to Portfolio Risk Management in Python. This course covers:
- Financial returns
- Mean, variance, and normal distributions (scaling volatility)
- Skewness and kurtosis (from scipy.stats import skew, kurtosis, Shapiro-Wilk test)
- Portfolio composition (calculating market-cap weights)
- Correlation and covariance (calculating portfolio volatility)
- Markowitz portfolios (MSR and GMV)
- The capital asset pricing model (calculating Beta)
- Alpha and multi-factor models (Fama-French 3-factor model)
- Expanding the 3-factor model (Fama-French 5-factor model)
- Estimating tail risk (historical drawdown, historical/conditional VaR)
- VaR extensions
- Random walks (Monte Carlo simulations)
>
Review of Python Courses (Part 7)
Posted by Mark on December 7, 2020 at 07:19 | Last modified: January 25, 2021 11:26In Part 6, I summarized my Datacamp courses 16-18. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #19 was Manipulating DataFrames with pandas. This course covers:
- Indexing DataFrames (using square brackets, using .loc, using .iloc, selecting certain columns with [[ ]] )
- Slicing DataFrames (R boundary included with .loc but not .iloc, slicing with one/two brackets gets Series/df)
- Filtering DataFrames
- Transforming DataFrames [vectorized computations in numpy without loops, .map() for index, .apply() for Series]
- Indexed objects and labeled data (name attribute for index and columns attributes)
- Hierarchical indexing (sorting MultiIndex)
- Pivoting DataFrames [.pivot(index= , columns= , values= ) ]
- Stacking and unstacking DataFrames (pivoting doesn’t work well on MultiIndex so unstack to move index to column)
- Melting DataFrames [reverses .pivot() ]
- Pivot tables
- Categoricals and groupby
- Groupby and aggregation/transformation
- Iterating over and filtering groupby object
- Understanding the column labels
- .idxmax() and .idxmin() (row/column label where max/min value located)
- .T attribute (transposes numpy array)
- Reshaping DataFrames for visualization
- Making a histogram (bins, range, normalizing)
>
My twentieth course was Manipulating Time Series Data in Python. This course has lots of good information for backtesting:
- How to use dates and times with pandas ( [sequences of] timestamp and period objects)
- Indexing and resampling time series [selecting missing ‘price’ values, .asfreq() ]
- Lags, changes, and returns for stock price series [.shift(), n-period % chg, .diff(), .pct_change(), stock price chg in df]
- Compare time series growth rates ( .iloc as abs ref, normalizing series, concat prices and .dropna, perf vs. benchmark)
- Changing the time series frequency: resampling
- Upsampling and interpolation with .resample()
- Downsampling and aggregation (plotting resample data with ax)
- Rolling window functions with pandas (plotting price and moving average, plotting multiple rolling metrics)
- Expanding window functions with pandas (calculating running return, running rate of return)
- Relationships between time series: correlation
- Select index components and import data
- Build a market-cap weighted index
- Evaluate index performance
- Index correlation and exporting to Excel
>
My course #21 was Working with Dates and Times in Python. This course covers:
- Dates in Python
- Math with dates (time delta)
- Turning dates into strings
- Adding time to the mix
- Printing and parsing datetimes (no time printed from datetime object)
- Working with durations
- UTC offsets
- Time zone database (from dateutil import tz)
- Starting Daylight Saving Time
- Ending Daylight Saving Time [ .datetime_ambiguous() and .enfold() for ambiguous times]
- Reading date and time data in Pandas (loading datetimes with parse_dates [or manually with .to_datetime() ])
- Summarizing datetime data in Pandas (alternative to for loop)
- Additional datetime methods in Pandas
- Index correlation and exporting to Excel
>
Review of Python Courses (Part 6)
Posted by Mark on December 4, 2020 at 06:53 | Last modified: January 21, 2021 13:14In Part 5, I summarized my Datacamp courses 13-15. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My course #16 was Introduction to Data Science in Python. This course covers:
- Creating variables
- What is a function?
- What is pandas?
- Selecting columns
- Select rows with logic
- Creating line plots
- Adding labels and legends
- Adding some style (line color, width, style, markers, template)
- Making a scatter plot (marker transparency)
- Making a bar chart (horizontal, error bars, stacked)
- Making a histogram (bins, range, normalizing)
>
My course #17 was Joining Data with Pandas. This course covers:
- Inner join (changing df values with .loc accessor)
- One to many relationships
- Merging multiple DataFrames
- Left join (count number of rows in a column with missing data)
- Right and outer joins
- Merging a table to itself (i.e. self join)
- Merging on indexes
- Filtering joins (semi-joins, anti-joins)
- Concatenate DataFrames together vertically [.append()]
- verify_integrity=True identifies accidental duplicates while validate arg helps to identify relationship type
- Using merge_ordered() (for ordered/time-series data and to fill in missing values)
- Using .merge_asof() (matches on nearest-value rather than equal-value columns)
- Selecting data with .query()
- Reshaping data with .melt()
>
Introduction to Linear Modeling in Python was my eighteenth course. This covers:
- Introductory concepts about models (interpolation, extrapolation)
- Visualizing linear relationships [object-oriented (OOP) approach to matplotlib]
- Quantifying linear relationships (covariance, correlation, normalization)
- What makes a model linear (Taylor series, overfitting, defining function to plot graph)
- Interpreting slope and intercept
- Model optimization (RSS: sum of squared residuals)
- Least-squares optimization (by numpy, Scipy, Statsmodels)
- Modeling real data
- The limits of prediction
- Goodness of fit (deviations, residuals, and R-squared in code)
- Standard error (RMSE measures spread of residuals whereas SE measures uncertainty in model params)
- Inferential statistics concepts
- Model estimation and likelihood
- Model uncertainty and sample distributions (bootstrap in code)
- Model errors and randomness
>
Review of Python Courses (Part 5)
Posted by Mark on November 20, 2020 at 07:23 | Last modified: January 20, 2021 06:41In Part 4, I summarized my Datacamp courses 10-12. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
My thirteenth course was Pandas Foundations. This course covers:
- Review of pandas DataFrames
- Building DataFrames from scratch
- Importing and exporting data [pd.read_csv() args: header, names, na_values, parse_dates]
- Plotting (arrays, series, DataFrames) with pandas
- Visual exploratory data analysis (line, scatter, box plots, histogram, and different plotting idioms)
- Statistical exploratory data analysis
- Separating populations
- Indexing time series (creating and using a Datetime index)
- Resampling time series data
- Manipulating time series data
- Time series visualization (pandas, not matplotlib)
- Reading and cleaning the data (cleaning and tidying datetime data)
>
Class #14 for me was Statistical Thinking in Python (Part 1). This course covers:
- Introduction to exploratory data analysis
- Plotting a histogram
- Plot all of your data: Bee swarm plots [sns.swarmplot()]
- Plot all of your data: Empirical Cumulative Distribution Functions (ECDF)
- Introduction to summary statistics: the sample mean and median
- Percentiles, outliers, and box plots
- Variance and standard deviation
- Covariance and the Pearson correlation coefficient
- Probabilistic logic and statistical inference
- Random number generators and hacker statistics
- Probability distributions and stories: the Binomial distribution (binomial PMF and CDF)
- Poisson processes and the Poisson distribution
- Probability density functions
- Introduction to the Normal distribution
- The Normal distribution: properties and warnings
- The Exponential distribution
>
My fifteenth course was Introduction to Data Visualization with Matplotlib. This course covers:
- Adding data to axes
- Customizing your plots (adding markers, setting linestyle, color, axis labels)
- Small multiples with plt.subplots
- Plotting time-series data (using fig/ax, zooming in on datetime range)
- Plotting time-series with different variables (using twin axes, coloring vars and ticks, all-encompassing function)
- Annotating time-series data
- Quantitative comparisons: bar charts (stacking, adding legend, color)
- Quantitative comparisons: histograms
- Statistical plotting
- Quantitative comparisons: scatter plots (encoding time by color)
- Preparing your figures to share with others (choosing plot style)
- Sharing your visualizations with others [fig.savefig()]
- Automating figures from data
>
Review of Python Courses (Part 4)
Posted by Mark on November 12, 2020 at 07:08 | Last modified: January 19, 2021 10:18In Part 3, I summarized the 6th through 9th Datacamp courses I took. Today I will continue with the next three.
As a reminder, I introduced you to my recent work learning Python here.
Course #10 was Introduction to Databases in Python. This course covers:
- Databases consist of tables
- Connecting to your database
- Introduction to SQL queries
- Filtering and targeting data
- Ordering query results
- Counting, summing, and grouping data
- SQLAlchemy and pandas for visualization
- Calculating values in a query
- SQL relationships
- Working with hierarchical tables
- Handling large ResultSets
- Creating databases and tables
- Inserting and updating data into a table
- Deleting data from a database
- Census case study
- Populating and querying the database
>
Course #11 was Introduction to Statistics in Python. This course covers:
- What is statistics?
- Measures of center
- Measures of spread and outliers (from scipy stats import iqr)
- What are the chances [probability and dataframe .sample() method from random module]?
- Discrete and continuous distributions
- Generating random numbers according to uniform distribution (from scipy stats import uniform)
- Computing cumulative distribution functions
- Binomial distribution (from scipy stats import binom)
- Normal distribution (from scipy stats import norm)
- Central limit theorem
- Poisson distribution (from scipy stats import poisson)
- Exponential distribution (from scipy stats import expon)
- Student’s t-distribution
- Log-normal distribution
- Correlation (and caveats)
- Experimental design and confounders
>
My twelfth course was Introduction to Data Visualization in Python. This course covers:
- Plotting multiple graphs
- Customizing axes
- Legends, annotations, and styles
- Working with 2-D arrays and meshgrid
- Visualizing bivariate functions (color bar, color map, axis tight, and contour plots)
- Visualizing bivariate distributions (rectangular and hexagonal binning)
- Working with images
- Visualizing regressions [sns.lmplot(), hue, col, sns.residplot()]
- Visualizing univariate distributions [sns.stripplot(), sns.swarmplot(), sns.violinplot()]
- Visualizing multivariate distributions [sns.jointplot(), kde, sns.pairplot(), hue, covariance sns.heatplot()]
- Visualizing time series (formatting datetime index)
- Time series with moving windows
- Histogram equalization in images
>