layout |
---|
default |
Technological advances in data gathering and storage have led to a rapid proliferation of large amounts of data in diverse areas such as climate studies, cosmology, medicine, Web data processing, and engineering. Making sense of this data deluge requires a set of skills which have become fundamental in any major corporation and any almost any scientific discipline.
From web scraping and data wrangling to advanced topics like machine learning and deep learning, in this course you will learn a variety of skills by working on real examples.
- Gather and organize. Learn how to use Python to gather and organize data programmatically and prepare it for deeper analysis. Key technologies: Beautiful Soup, Pandas.
- Analyze. Discover patterns and trends lurking in the data and extract conclusions. Key technologies: NumPy, SciPy, Pandas.
- Model. From linear regression to deep learning, learn to model complex phenomena and use data to automate decisions. Key technologies: scikit-learn, Keras, PyTorch, Tensorflow.
- Report. Display information visually and communicate your findings in a way that is clear and compelling. Software packages: Matplotlib, Jupyter, Bokeh.
Fabian Pedregosa, <[email protected]>, Postdoctoral Researcher, UC Berkeley.
Laurent El Ghaoui, Professor, UC Berkeley.
Teaching assistant: Bowen Yin Wang
Invited speaker (September 26th): Nelle Varoquaux, Postdoctoral Researcher, Berkeley Institute for Data Science.
Office Hours: Office hours are available for students who need further clarification of concepts presented in lecture, or have made solid attempts on the homework assignment or other practice problems and require further assistance understanding how to approach such problems.
The office hours are usually Fridays from 15h to 17h. The google calendar below always has the latest information.
This course only requires an undergraduate level on statistics, linear algebra and calculus. I will assume basic understanding of the Python language.
Students are required to come with a laptop to the the lab sessions. Please come with a Python distribution such as Anaconda installed to minimize set up time. If you would like to attend but you don't have access to computing resources, contact me and we will figure something up.
The course is organized in sessions of 3 hours, from 14h to 17h. Each session is split in 1h of theory (presentation and whiteboard) and 1h45 of lab practice, with a 15min break in-between. The course will take place in Saturdjia Dai Hall room 250.
Pioneers of data science. Introduction to regression models. Dimensionality and structured models. Model selection and bias-variance tradeoff. Classification.
Practical session: The Jupyter (formerly IPython) interactive environment. NumPy, Python's array computing library. As material we will use chapters 1 and 2 of Jake VanderPlas' excellent Python Data Science Handbook.
Analysis of dataset. Work by paired programming. Lecture material
This session will be given by invited speaker Nelle Varoquaux. slides
Permutation tests. Logistic regression. Slides: part 1, part 2
In this session, students will present their first assignment. In this assignment, the students should make a 15-min presentation (10 min presentation + 5 min questions) on the project of their choice.
Supervised learning models. Overfitting. Model selection. Class material.
Clustering, dimensionality reduction, feature extraction. Class material
Deep convolutional networks. Pretrained networks. Library used: Keras. Class material
Working with text and time series. Class material
Tutorial on Generative Adversarial Networks (GANs)
We will contribute to scikit-learn by fixing issues and improving the code.
Final presentation of projects.
<iframe src="https://calendar.google.com/calendar/embed?src=6ihedkadh888fr6rch80hq8j44%40group.calendar.google.com&ctz=America/Los_Angeles" style="border: 0" width="800" height="600" frameborder="0" scrolling="no"></iframe>
This course has three important rules. If you choose to follow these rules, your odds of learning the material and earning a good grade in this class will improve greatly.
- Work. To succeed in this class, you must choose to do your very best on all your assignments. See the course Assignments, for additional information on completing assignments.
- Participate actively. To succeed in this class, you must choose to stay focused and involved, offering your best comments, questions, and answers. This is a seminar class, not a lecture class – active discussion is expected of all students.
- Respect. You will be exposed to a variety of viewpoints, values and opinions in college that will differ from your own. All students in this class should feel comfortable expressing their viewpoints and concerns in class. You are an important part of creating an atmosphere that makes this possible. This applies to me too!
What you can expect from me:
- Attend every class period and arrive to class on time.
- Provide access to quality learning material adapted to the level and background of all students.
- Use a variety of teaching techniques and modalities to accommodate different learning styles.
- Return written assignments in class and online in a timely fashion and provide helpful feedback.
- Come to class with a positive and friendly attitude.
- Be respectful of your ideas and value the diversity you bring to the classroom.
- Be open to dialogue that challenges me.
- Answer any appropriate questions you may have.
- Be present during my stated office hours.
Ani Adhikari and John DeNero, Inferential thinking
Jake VanderPlas, Python Data Science Handbook
Trevor Hastie and Robert Tibshirany, Statistical Learning, Stanford MOOC on data science and machine learning. I will be reusing some of their slides.
Joel Grus, Data science from scratch: First principles with Python, O'Reilly Media, 2015.
Wes McKinney, Python for data analysis, O'Reilly Media, 2013.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, An Introduction to Statistical Learning
Trevor Hastie, Robert Tibshirani, Jerome Friedman The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
Andreas Müller and Sarah Guido Introduction to Machine Learning with Python. A Guide for Data Scientists