数据科学工具课程笔记
数据科学工具课程,包含命令行,R语言,Markdown等用法
CLI (命令行 Command Line Interface)
/
= root directory~
= home directorypwd
= print working directory (current directory)clear
= clear screenls
= list stuff-a
= see all (hidden)-l
= details
cd
= change directorymkdir
= make directorytouch
= creates an empty filecp
= copycp <file> <directory>
= copy a file to a directorycp -r <directory> <newDirectory>
= copy all documents from directory to new Directory *-r
= recursive
rm
= remove-r
= remove entire directories (no undo)
mv
= movemove <file> <directory>
= move file to directorymove <fileName> <newName>
= rename file
echo
= print arguments you give/variablesdate
= print current date
GitHub 代码仓库
- Workflow
- make edits in workspace
- update index/add files
- commit to local repo
- push to remote repository
git add .
= add all new files to be trackedgit add -u
= updates tracking for files that are renamed or deletedgit add -A
= both of the above- Note:
add
is performed before committing
- Note:
git commit -m "message"
= commit the changes you want to be saved to the local copygit checkout -b branchname
= create new branchgit branch
= tells you what branch you are ongit checkout master
= move back to the master branchgit pull
= merge you changes into other branch/repo (pull request, sent to owner of the repo)git push
= commit local changes to remote (GitHub)
Markdown 文本编辑
##
= signifies secondary heading (bold big font)###
= signifies tertiary heading (slightly smaller font than secondary, not bold)*
= bullet list item
R Packages
- Primary location for R packages \(\rightarrow\) CRAN
available.packages()
= all packages availablehead(rownames(a),3)
= returns first three names of ainstall.packages("nameOfPackage")
= install single packageinstall.packages(c("nameOfPackage", "nameOfPackage", "nameOfPackage")
= install multiple package- Bioconductor Packages:
source("https://bioconductor.org/biocLite.R")
biocLite()
= install bioconductor packages
library(packagename)
= load packagesearch()
= see all functions in package after loading
Types of Data Science Questions
- in order of difficulty: Descriptive \(\rightarrow\) Exploratory \(\rightarrow\) Inferential \(\rightarrow\) Predictive \(\rightarrow\) Causal \(\rightarrow\) Mechanistic
- Descriptive analysis = describe set of data, interpret what you see (census, Google Ngram)
- Exploratory analysis = discovering connections (correlation does not = causation)
- Inferential analysis = use data conclusions from smaller population for the broader group
- Predictive analysis = use data on one object to predict values for another (if X predicts Y, does not = X cause Y)
- Causal analysis = how does changing one variable affect another, using randomized studies, Strong assumptions, golden standard for statistical analysis
- Mechanistic analysis = understand exact changes in variables in other variables, modeled by empirical equations (engineering/physics
Data
- Data = values of qualitative or quantitative variables, belonging to a set of items (usually population)
- Variables = measurement/characteristic of an item (qualitative vs quantitative)
- Data = not always structured, usually raw file, different formats
- Most important thing is question, then it is data
- Big data = now possible to collect data cheap, but not necessarily all useful (need the right data)
Experimental Design
- Formulate you question in advance
- Statistical inference = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
- [Inference] Variability = lower variability + clearer differences = decision
- [Inference] Confounding = underlying variable might be causing the correlation (sometimes called Spurious correlation)
- dealing with confounding: fix variables, stratify (all options), randomize
- [Prediction] collection observations for different variable values, build predictive functions
- similar problems of probability/sampling and confounding variables
- [Prediction] Difficult to understand where observation is from from different distributions. (size of effects important)
- [Prediction] Positive/negative statuses: True positive, false positive, false negative, true negative
- Sensitivity = Pr(positive test | disease)
- Specificity = Pr(negative test | no disease)
- Positive Predictive Value = Pr(disease | positive test)
- Negative Predictive Value = Pr(no disease | negative test)
- Accuracy = Pr(correct outcome)
- Data dredging = use data to fit hypothesis
- Good experiments = have replication, measure variability, generalize problem, transparent
- Prediction is not inference, and be ware of data dredging