Topic outline

  • Resources and information

    Zoom links:

    Live Code on DeepNote

    Drive Folder

    • https://drive.google.com/drive/folders/1EBjwJ7gokBGCwDaMeJ-mNqgOhsvQ3NMQ



    Exam Guidelines


    Dear all

    I normally receive a lot of questions about the project that you are required to put together for the exam. What you see below is a list of replies to FAQs from the previous years. I thought it would be a good idea to give to keep it on this page and to update it just a little. Read it and if a lot of things do not make much sense, be patient: they will in a couple of weeks. 


    • The formal goal of the exam is to assess the level of your programming skills; it does not matter where you started from, it matters what you can do by the end of the semester. Here is a couple of things that you have to be able to understand and that I expect to see in your projects:

      • for and while loops

      • if/else statements

      • data types handling and use of the methods associated with them (in other words you must be capable of using strings, lists, tuples and dictionaries, among other data types)

      • character encodings

      • function definitions

      • training and application of some models/algorithms

    • The real goal of the exam is to push you to find a way to teach yourself something new in a problem-based scenario. This is why you are encouraged to do some research and find a subject that poses a set of problems that you are interested in trying to solve. What we have seen in class and what you can find in the NLTK book should be regarded as a foundation of your skills, but you should find an interesting topic of research to push yourself to extend your knowledge. Let me give you a couple of examples:

      • we will talk about Part Of Speech classification (or tagging) but we might not discuss other supervised classification problems; you should be interested in taking a look at other areas like sentiment analysis, text classification, topic modeling, word sense disambiguation, etc.;

      • we will discuss, briefly, the word tokenization problem and we will see that for languages that use the Latin alphabet it is possible to find many heuristic solutions; however there are many scripts that pose a different problem, like Arabic (I'm talking about the script, not the language); trying to solve such a problem could be a good starting point for your project;

      • we will talk about semantic spaces, but we might not be able to see many examples in detail; again, trying to find your solutions to this problem might is a good starting, if you want to teach yourself something new.

    • Here is a couple of things that you should not do:

      • copy your code from another source without understanding what it does; I am not saying that you cannot recycle other programmers' code;  I am saying that you should make sure that you understand everything about it, line by line, bit by bit; the reason is very simple: if you use some complex code in your project and, when asked to explain it, you fail, I can only assume that you do not know how to code; that would be the worst-case scenario; my advice: do things from scratch as much as possible and learn while doing so;

      • make an exact copy of the projects that we have seen in class without adding anything to it; let's imagine that your project consists in preprocessing some English corpus available via the NLTK data package, that we haven't used in class, in order to do exactly what we did in class: pos_tagging, lemmatization, word2vec training; in this case you would have missed the unique opportunity to develop your coding skills; an opportunity that you might not have again in the next semesters; my advice: reuse the knowledge that you have and invest it in something new;

      • convince yourself that coding is beyond you; anybody can learn how to code and how to solve complex problems: it's just a matter of exercise.  


    • Calendar

      Here you can find the link to my google calendar. 

      • 21/09/2020, Monday, 17:00-18:30, online
      • 23/09/2020, Wednesday, 10:30-12:30, and 13:30-15:30, Cl. 21
      • 24/09/2020, Thursday, 18:00-19:30, online
      • 25/09/2020, Friday, 11:00-12:30, Cl. 21
      • 28/09/2020, Monday, 10:15-11:45 Cl. 21
      • 28/09/2020, Monday, 17:00-18:30 online
      • 30/09/2020, Wednesday, 10:30-12:30 and 13:30-15:30, Cl. 21
      • 02/10/2020, Friday, 11:00-12:30, Cl. 21
      • 12/10/2020, Monday, 10:15-11:45 Cl. 21
      • 12/10/2020, Monday, 17:00-18:30 online
      • 14/10/2020, Thursday, 13:30-15:00, Cl 21
      • 15/10/2020, Thursday, 18:00-19:30, online
      • 16/10/2020, Friday, 11:00-12:30, Cl. 21
      • 21/10/2020, Wednesday, 13:30-15:30, Cl. 21
      • 30/10/2020, Friday, 11:00-12:30, online
      • 06/11/2020, Friday, 11:00-12:30, online
      • 13/11/2020, Friday, 11:00-12:30, online
      • 20/11/2020, Friday, 11:00-12:30, online
      • 27/11/2020, Friday, 11:00-12:30, cl21

      • Resources

        • Week 1

          We didn't have as much time as I thought we would (my bad), but we still managed to see a couple of things. Let me just recap here what we did and what your assignment for next week is:

          • Installation of a) the Anaconda distribution package and b) the Windows Subsystem for Linux (the Ubuntu terminal) and c) a free and decent text editor (e.g: Notepad++, Sublime, Textpad, etc.). Your task for next week is to install these tools on your laptop. If you can, use your own laptop in class and make sure that it connects to unitn-x. If you cannot use your laptop or if you cannot connect to the unitn-x network (or if you have a slow connection), let me know in class next week.
          • Unix for Poets. A great beginner textbook that introduces you to some of the most used Unix commands "to do things with characters". Your assignment is to read the pdf, try to replicate the code, get stuck somewhere, try to solve any problem that "the slings and arrows of outrageous fortune" might present to you. In a less elegant way, your motto is going to be: no pain, no gain. NB: remember that the typographic character is not a single quote (') and that you have to substitute it in your terminal commands.
          • Where to download some sample text. There are countless sources of plain text on the web, but we used the Gutenberg Project website to find some utf8 encoded plain text file. Download something and use it to do things with a terminal. Remember, if you are trying to access a file that you put in your Download folder, from within your Windows Subsystem for Linux terminal, you have to use a path that will look like this:

            /mnt/c/Users/your_user_name_goes_here/Downloads

            where the crucial part is "/mnt/c/

            If the notion of a path is still somehow mysterious for you, watch this  and see if it helps you understand the concept. If still do not understand it completely, do not worry: you will at some point.


          • Week 2

            Update 30 September 2020


            Here you can find a special notebook with an extended comment on the difficult function seen today in class 


            Here you can find a commented version of the notebook that I wrote during the class. It contains an exercise, which is one of the assignments for your next week.

            Here you can find another notebook from another year which contains a different "first dive" into Python and which covers some interesting topics that we will see next week. Study it and see if you can understand and use the concept of  "function".


            • Week 3

              Here you can find the notebook for the 3rd week and some exercises. 
              Here you can find another notebook with some extra information and thing to know about regular expressions
              • Week 4

                Dear all

                Here you can find the notebooks of week 4. In the 4.2 one, you will find a couple of exercises. You will need this pdf in order to do one of them. As for the part on scraping, check the section below, where you will find some material from last year, with code, comments, and exercises. 

                • Scraping

                  Dear All


                  here you find a Jupyter Notebook (remember to right click the link and download the file, if you want to have it on your computer)  with the assignment for the week and an image that should help you understand what you are supposed to do. 


                  Look for the html code associated with these elements



                  And finally .... here you can find the notebook of this week. I put everything in one file. Enjoy.

                  • Week 5: More on Scraping

                    Dear All, here you can find the latest notebook. It contains an exercise that will require some visual help for you to understand it. Take a look at this image to see which data you have to extract from the web page. 


                    Instructions

                    • Week 6

                      Dear All

                      Here is the latest notebook: introducing Word2Vec. You will find an interesting exercise in it: try your best to solve it. In class, I told you to read a couple of papers (you can skip the last one, but I leave the link in case you want to read it): 


                      In addition to that, try also to solve the following short problems:

                      • Define a function that takes in input a string and returns the lower case version of it. 
                      • Define a function that takes two arguments, a subscriptable object (e.g: a list or a string) and an integer and returns the item of the subscriptable object that corresponds to the index expressed by the integer. For example, the equivalent of "abcd"[0], which returns "a".
                      • Define a function that takes a string and an integer in input and return the boolean True, in case the string is shorter than the length expressed by the integer and False otherwise. You will have to use an if-else statement.

                      • Week 8

                        You can download a small sample of the Paisa Corpus (Italian) following the link below:

                        https://drive.google.com/open?id=105ddLNcBH7SiR07qvdQyL3L42g_T5sGC


                        A notebook form last year with some useful code on POS-tagging

                        The last notebook: POS-tagger training and testing
                        • Assignments