-
-
Notifications
You must be signed in to change notification settings - Fork 1k
GSoC_2017_detox
...continuing from last year
- Heiko (github: karlnapf, IRC: HeikoS)
- Viktor (github: vigsterkr, IRC: wiking)
- Rahul (github: lambday, IRC: lambday)
- Pan (github: OXPHOS, IRC: OXPHOS)
- Sanuj (github: sanuj, IRC: sanuj)
Medium. But requires a lot of initiative and willingness to dive into existing code (that is not pretty).
You need know
- Shogun's internals (to an extend)
- C++
- Software engineering principles
Every line of code in SHOGUN has a long history and have gone through many brains and hands. This made SHOGUN what it is today: a powerful toolbox with a lot of features. But most of the code has been written by researchers for their studies. Usually the focus is on "getting things done", proving awesome ideas and optimize them "as fast as possible".
As a drawback, people didn't care too much about software engineering aspects. In addition, lots of new technologies have shown up since some parts of the code have been written, which allows us to do even cooler things with less code now.
We want this project to improve maintainability, stability, and beauty:
- Making heavily used base classes more lightweight to improve performance and memory consumption.
- Use new and cool technologies
- New language features (think of C++1x)
- and more
The target group of this project are people with C/C++ background, an idea about "good software" engineering, and reliable software. In return we offer that you'll learn a lot about basic machine learning algorithms; of course there are some low-hanging fruits, but if you're an advanced hacker, we have a lot of great ideas how to push the project forward.
GSoC is a marathon, not a sprint. We expect "good" performance over the whole project and to stay in contact with us. Get on board and commit to contribute actively and we'll promise to bring you on speed with magic internals that are hidden in SHOGUN. :)
Here are some sub-projects. We are open for more:
NOTE: A GSoC project will address multiple (or ideally all) of those topics.
Sanuj did a great job last in GSoC 2016 in writing a new parameter framework. We are working on integrating it, and this needs some more effort. Getting rid of old code, fully integrating new code. Tying thing together with the rest of the framework. Moving towards plugin architecture. Interesting topic!
Key points
- Replace all
SG_ADD
with tags registration. - Make sure Shogun still works afterwards
- Once
SG_ADD
is removed, serialization andequals
will stop working, but they need to work, see serialization below.
Initial work:
- Read the docs on tags
- Register member variables in a selected class in tag (without removing the SG_ADD yet)
- Think about automating the ref-actoring.
Once, tags are (more or less) integrated, serialization and equals
will be next
Last year, Pan implemented a new cerealisation framework, which needs some love.
And the old one needs to die! Deep-copy of objects? Checking equality? Dump objects to disk and get 'em back? All done used to be done in here. Thousands lines of code, uncountable many switch-case
statements, and more special per-class and per-data-type code than we
want to maintain. Only one good reason why we didn't tackle it yet:
It used to work working.
Key points:
- All
SG_ADD
instances need to be replaced (or extended, matter of discussion) with the new tag framework - Once a class's parameters are registered, the class needs to be serializable via cereal
- Once that works, all old serialization code will be deleted
- An new
equals
method should also be easy from here
Initial work:
- Read and understand old serialization code (roughly), the cereal feature branch, and of course cereal docs
- Draft a prototype:
- Take a Shogun class
- Register its parameters in the tags framework
- Add a
dump
method toCSGObject
that uses existing cereal code - Write a test to ensure serialization works.
tl;dr: We want to stop making use of SG_REF
, but use c++11 magic instead.
tl;dr: We want to get rid of the old threading code that is: unusable, unmaintainable, and uncool. Replace with openmp or similar. Example
tl;dr: We want to have unified progress bars in Shogun (using SG_PROGRESS
). It should be possible to prematurely stop algorithms in Shogun (and still getting some results if that makes sense).
Shogun has many many bugs, we could actually fix some of them. Pick your favourite! https://github.com/shogun-toolbox/shogun/labels/BUG https://github.com/shogun-toolbox/shogun/labels/bugfixing
NOTE: This is such a big topic that we decided to not make it part of the project this year. tl;dr: File IO and parsing done right using modern C++.
SHOGUN contains tons (how many lines?) of code to just parse input data formats. The code is basically working, some of parsers have minor bugs, most of them read like "C89 with classes", and static code analysis tells us we need to do something here.
Lot things possible here: refactorings, deduplication, new API, make it less code, make it less NIH.
tl;dr: Being a software architect.
The foundation of every learning problem is data structures to be used by all algorithms. Dense/Sparse Features, for instance, or Dense/Sparse Streaming... duplicated functionality, special handling of feature classes in algorithm code; online algorithms not possible on non-stream features.
Buzzword bingo: Iterators(!), Separation of concerns; finding invariants in the existing classes; redesign of features APIs; going back to the board and analyze what's really needed; gain flexibility.
What's to be done here depends on you. The minimal goal is a small prototype to prove the idea of the topic you are working on. The full-fletched solution is, well, you guessed it: Hard work and lot of fame.
Extremely important Producing documents that will code the touched internals of Shogun to make future developers' lives easier.
Whatever you can imagine
It attempts to improve one of the biggest open problems we have in Shogun: Being unable to move because of being chained the framework. A modern, slim Shogun is the dream of every of our developers :)
- All core developers
Github issues / pull requests, in particular
- https://github.com/shogun-toolbox/shogun/pull/3621
- https://github.com/shogun-toolbox/shogun/issues/1991
- https://github.com/shogun-toolbox/shogun/issues/2824
- https://github.com/shogun-toolbox/shogun/issues/3605
- https://github.com/shogun-toolbox/shogun/issues/3509
- ...
Data structures:
- CSGObject
- openmp and pthreads
- tags project from last year
- Parameter.cpp
- TParameter
- all FileReader instances
- CFeatures