DII challenge, first trial on data science

  

DII (Discover, Innovate, Impact) challenge is a national data science challenge held by UTHeath with Amazon AWS platform. Coincidentally, I saw a group recruiting members when glancing at my phone. I personally had no practical experience in a data science project. All I had learned were the machine learning theories in a seminar in my undergraduate years. This challenge could be a pretty good chance for me to implement the theories I learned and get a feeling about all the tools the data scientists usually use. Doing scientific research in the era when interdisciplinary thinking is critical, it is definitely worthwhile to extend knowledge and skills. Plus, I also like the experience to try new things from scratch. So there I am.

Since the challenge was based on real patients’ data, the restrictions of getting access to the database are pretty strict. At the registration phase, we need to get a signature from the affiliation to sanction our participation, which became a tricky problem for us. Since we came from different departments in UT, it is kind of reasonable to get a single signature from CNS. And I was actually the only one in campus in summer break during the registration time. After a long-term negotiating and paperwork, I finally got the signature, which is just the start of our long journey.

The organizers did not give us access to remote Jupiter Notebook, we could only use command lines in the terminal interface GUAWS, which was pretty inconvenient. Maybe due to a lot of complaints about it, the organizers finally let us use ssh to use the Notebook, making debugging and training way much easier.

During the implementation, I got familiar with the scikit-learn, Spark and other commonly used data processing and manipulating tools. Compared with the fancy theories I understood, implementing the tools for coding is pretty straightforward and easy. Without the need to worry about the theoretical details, we could concentrate more on the pros and cons of different methods. As for the data, we chose to assume Markov property and stationary assumption to deal with the time series. Since there are so many missing values, instead of setting them to zero, mean or median, we first tried to do the imputation, which took a lot of time and memory. Different models were tested and RF was finally determined and tuned for submission purposes. Medical advices are given accordingly based on the predictions.

Later modification of the algorithm pushed the score to over 0.8. However, further improvement was not reached and we did not have pretty much time to try more things due to the heavy workload of labs and classes especially at the beginning of this semester.

Overall, participating in DII challenge helped me build friendships and gain practical experience on manipulating data. I sincerely appreciate the challenge organizes’ responsiveness and teammates’ cooperations.

Comments

Popular posts from this blog

Lab Software Development 3 - Scanning Probe Microscope Controller

Digit sequence detection of SVHN dataset

Lab Software Development 2 - free-carrier DIFFUSION analysis system