We built Embibe with a singular vision: to maximise learning outcomes at scale. Making a positive impact to a user’s learning outcome is a difficult, but important, problem to solve. In fact, there are a number of non-trivial open sub-problems, each of which needs to be solved in order to realise the lofty goal of intentionally and positively affecting learning outcomes.But first, what are learning outcomes? And why do we care about them?In today’s highly competitive world, a student is measured to a large extent by how much she can score in a competitive exam or even school classroom. Her score can have a significant impact on her career options. For the purposes of this article, let us frame learning outcomes as a function of a student’s innate as well as trainable potential, to learn, absorb, and apply content material optimally, within strictly specified time constraints; so that she can maximise her score in any particular competitive academic context.In developing countries like India, the student-to-teacher ratio is highly skewed and teachers cannot effectively provide personalised attention at an individual level. This leads to a dilemma, given that each student learns and absorbs information at different rates and has different levels of aptitude. A known side-effect of the inability of teachers to provide personalised attention is that for any given classroom/collection of students, the learning material is always presented to cater to the “average” student. Therefore, very bright students do not reach their full potential and will not be able to truly flex their academic muscle, while scholastically weaker students will have a hard time coping with the rest of the classroom. However, existing online learning platforms and systems are not able to truly facilitate personalised learning at the student level.There are 2 step formula that embibe and its data team use to help solve this problem for millions of students -Content Ingestion — Getting enough of the right content for every unique and individual student.Content Delivery — Giving each student exactly the content that he needs to see at exactly the right time.Content IngestionAuto ingestion of contentDozens of syllabus boards, thousands of chapters and concepts, and tens of thousands of institutes and schools result in hundreds of thousands of questions and answers generated and used by instructors every year. Imagine if every student were able to test their knowledge before exams on any subset, or all of these questions, along with getting detailed explanations about the correct answers, and common mistakes made. In order to make this a reality, we are leveraging optical character recognition (OCR) and machine learning to build our own automated ingestion framework that will be highly scalable, truly multilingual, and minimally dependent on human input. And the fun doesn’t stop there. The framework will also be able to ingest handwritten content in a writer-agnostic fashion, thereby rapidly adding to our already fantastic repository of questions, answers, concepts, explanations, and knowledge.Concept taggingAlright, so now we have questions, answers, concepts, and chapters all ingested into a massive data warehouse. It would be painful to manually tag each question or chapter with its relevant concepts, or vice versa. Data Science to the rescue! Using bleeding-edge ideas from text classification, topic modelling, and deep learning, we automatically tag concepts to questions, answers, and chapters.A selection of the most popular concepts as browsed by Embibe users using the Learn feature, in the months of December 2015, January 2016, and February 2016.Our prior databases containing seed sets of high quality manually tagged content is instrumental as we extract linguistic, lexical, and context-sensitive features, to train state-of-the-art text-tagging models for all the new data that gets ingested into our systems.Metadata enrichmentThere is a wealth of information available online today on any topic that one wishes to learn about. Ideas and concepts build on one another. For instance, the First Law of Thermodynamics is related to the concept of a thermodynamic system, which in turn is related to the concepts of specific heat capacities of gases, conservation of mechanical energy, and work done by a gas, among others. Our content ingestion framework includes data enrichment components that automatically crawl the web and tag content with such diverse pieces of media as text explanations, video links, definitions, user commentary, and forum discussions, all while respecting copyrights, and properly attributing ownership on sourced content. This wealth of available information also makes it possible to automatically connect related concepts in a tree structure. Using ideas from the fields of graph theory, text mining, and label propagation on sparse structures, we create links and interconnections between concepts that share a source→target relationship.Automated build-out of a tree of concepts, for a subset of ideas from Mathematics, each connected to one or many related conceptsSimilar questions clusteringIf you were preparing for an exam, would you want to practice the same question over and over again? That would not be helpful. Conversely, imagine how immensely useful it would be to practice a small set of relevant questions that will help you completely master some new concept or chapter. With our access to hundreds of thousands of questions, we have developed the capability to cluster questions based on similarity across a number of dimensions — content-targeted, concept-tested, difficulty level, and exam goals, among others.Text clustering based on latent semantic information spaces, and their combination with other categorical and numerical feature spaces allow us to precisely group our universe of questions into areas of interest that can be tailored for each individual using Embibe. Additionally, this rich resource of textual data which we have transformed into robust numeric feature spaces related to concept clusters lets us slightly perturb the existing data to generate potentially infinite expressions of the question space. More questions at run time, never seen before! This allows us to give users the maximum value for their time spent on our platform.Content DeliveryUser profilingWe track every move that a user makes on Embibe. The millions of practice and test attempts made by our users over the past three years is calibrated in a data space of many thousands of dimensions. This translates to a space of billions of data points that we can mine to dig deep into our users’ behavioural data and generate insights that correlate with how learning happens. Each additional attempt by a user, tweaks her ability to score higher on the concepts tagged to that attempt, along with the connected preceding and succeeding concepts. This super complex problem involves leveraging ideas from sparse matrix processing, computational algorithms in graph theory, and item response theory to build robust and adaptive user profiles that scale with our growing user base.An interesting bar graph that shows the time (hour of day in IST) at which users start their test sessions on Embibe. Medical (AIPMT) users have a defined spike around 10am and between 3pm to 5pm. Engineering (JEE) users, on the other hand, show gradually increasing session start times as the day progresses, which peaks around 4pm to 8pm. JEE students also consistently start more practice sessions between 5pm and 3am compared to AIPMT students. We are guessing doctors are more disciplined!Our extensive instrumentation and measurement of user activity at a very granular level gives us the ability to infer latent preferences related to learning styles associated with individual users. For instance, certain students may learn, and thereafter perform on tests, better with the help of video explanations, compared to other students who prefer extensive textual descriptions, or still others who learn by working step-by-step through solved example problems. We can map users to well studied theoretical models of learning styles like the Dunn and Dunn Model (Dunn & Dunn 1989), or Gregorc’s Mind Styles Model (Gregorc 1982) to automatically tailor remedial courses of practice and help the user towards score improvement.User cohortingCohorting is a classical clustering problem. Users are grouped based on their usage patterns with respect to product features as well as their performance patterns with respect to test, practice, and revision sessions. Each user is mapped to a high dimensional feature space of many thousands of attributes, which include static as well as temporal measures. Cohorting on temporal measures gives us the ability to cold start low activity and new users by assigning probable cohort trajectories to these users based on their initial activity. User cohorting is a core requirement for our higher level deep science features like micro-adaptive learning, automated feedback generation, and content recommendation.One possible view of user cohorts — tied to long term test performance. Based on their overall test scores, Achievers are the top percentile bracket of users on Embibe, Performers the next bracket, and Fighters the final bracket. The various facets shown, relate to different aspects of score improvement that we have clustered our feature space into. For instance, we can see that even though Facet_A varies significantly across cohorts, by targeting feedback to and affecting other learning facets, it is possible to push users into the next higher cohort.Micro-adaptive learningBite-sized delivery of content and feedback is key to effectively learning online. Generally, users spend between 30 minutes to an hour online, practicing concepts and questions. Within this short time span, it is very important to maximise the impact of each time-bound session. Each session is an asset t