This paper deals with the data generation process implemented for an analysis of the impact of the 9-Euro ticket on mode choice. We discuss the assumptions made and procedures used to process a raw dataset that is based on GPS traces of individuals’ movements and on survey data into the choice-set for a discrete choice model. Several steps of cleaning and merging are described in order to a) obtain a reliable dataset; b) define available modal alternatives with attributes such as distance, duration, and costs; and c) impute the travel purpose for each movement to form. Our main contribution is to show that a systematic analysis of the sample obtained at different stages of data processing is important to make sure that the final sample is unbiased. Furthermore, we contribute by analysing the difference between observed travel time and travel time calculated by routing tools such as Google Maps. We show that the often- employed approach of estimating RP based choice models on the basis of observed travel times for the chosen mode of transport but calculated travel times for the non-chosen alternatives can introduce a structural bias into the sample.
Keywords: Data processing, travel behaviour, GPS traces, discrete choice models, revealed preferences