none
multiple dates in a string RRS feed

  • Question

  • A column has all string of text data with multiple dates. I need to select both the dates and couple of other columns from that column as different columns. 

    The column data is as shown 

    company xyz was started on 2016-01-01 and reported MN on 2016-04-01 X Y 

    company abc was started on 2015-01-01 and reported NA on 2016-01-02 K N

    company abcd was started on 2015-01-02 and reported DE on 2015-10-02 X N 

    company abd was started on 2015-11-17 and reported MN on 2015-12-02 K Y 

    company mnc was started  CM   Y 

    Here is what i need to select

    col1           col2       col3    col4  col5

    -----------------------------------------

    2016-01-01 MN 2016-04-01  X     Y

    2015-01-01 NA  2016-01-02  K    N

    2015-01-02 DE 2015-10-02  X     N 

    2015-11-17 MN 2015-12-02  K     Y

    Null            CM Null             NUll  Y

    is this possible? experts need your valuable inputs


    • Edited by oleolehoohoo Thursday, December 29, 2016 4:20 PM
    Thursday, December 29, 2016 4:16 PM

All replies

  • Thursday, December 29, 2016 4:25 PM
    Moderator
  • If all the data is in clean format, try using PATINDEX:

    extract date from text string - Transact-SQL

    PATINDEX to retrieve date from varchar/memo data field



    Kalman Toth Database & OLAP/AI Architect SQL Server 2016 Database Design
    New Book : Beginner Database Design & SQL Programming Using Microsoft SQL Server 2016



    Thats the issue the data is not in a proper format in the source. I am trying to write a conversion script. I tried to do a select patindex('%[1-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]%',[Flat file source error output column]) from table and it returns 0.
    • Edited by oleolehoohoo Thursday, December 29, 2016 4:43 PM
    Thursday, December 29, 2016 4:40 PM
  • >Thats the issue the data is not in a proper format in the source

    You need to cleanse the data first. Then you can apply PATINDEX or other string functions.



    Kalman Toth Database & OLAP/AI Architect SQL Server 2016 Database Design
    New Book : Beginner Database Design & SQL Programming Using Microsoft SQL Server 2016


    Thursday, December 29, 2016 4:50 PM
    Moderator
  • >Thats the issue the data is not in a proper format in the source

    You need to cleanse the data first. Then you can apply PATINDEX or other string functions.



    Kalman Toth Database & OLAP/AI Architect SQL Server 2016 Database Design
    New Book : Beginner Database Design & SQL Programming Using Microsoft SQL Server 2016


    OK. any ssis transform that would help cleanse data?
    Thursday, December 29, 2016 4:53 PM
  • the thing is we need data from every row in the input column. So there is really nothing to delete. But if there is no value then we would prefer to use Null in the destination.
    Thursday, December 29, 2016 4:56 PM
  • What are the valid values for col5 column?  Y/N?

    What are the valid values for col4 column?  K/X?

    The col2 column contains the abbreviations of the USA states?

     

    Below is an initial suggestion that can be refined, especially after you provide the answers to the above questions. It is a simple approach.

    Depending on the answers, it is simpler to get col4 and col5 columns by analyzing the text in reverse.


    -- code #1 v7
    -- trimmed the text column
    UPDATE O
      set TextData= (T.Trimmed + space(1))
      from tbOriginal as O
           cross apply dbo.AllTrim (O.TextData) as T;

    set dateformat ymd;
    --
    with
    -- posição inicial do texto de cada coluna      
    tbPos1 as (
    SELECT *,
           Pos1_Company= (Patindex('company "%', TextData)),
           Pos1_col1= (Patindex('%started on [0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9] %', TextData)),
           Pos1_col2= (Patindex('%reported __ %', TextData)),
           Pos1_col3= (Patindex('%reported __ on [0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9] %', TextData)),
           Pos1_col4= (Patindex('% [KX] %', Reverse(TextData))),
           Pos1_col5= (Patindex('% [YN] %', Reverse(TextData)))
      from tbOriginal
    ),
    -- delimitador final do texto
    tbPos2 as (
    SELECT *,
           Pos2_Company= case when (Pos1_Company > 0)
                              then Charindex('"', TextData, Pos1_Company +9)
                              else 0 end   
      from tbPos1
    ),
    -- obtém texto de cada coluna
    tbCol as (
    SELECT *,
           Company= case when (Pos2_Company > 0)
                         then Substring(TextData, Pos1_Company +9, (Pos2_Company - Pos1_Company -9)) end,
           col1= case when (Pos1_col1 > 0)
                      then Substring(TextData, Pos1_col1 +11, 10) end,
           col2= case when (Pos1_col2 > 0)
                      then Substring(TextData, Pos1_col2 +9, 2) end,
           col3= case when (Pos1_col3 > 0)
                      then Substring(TextData, Pos1_col3 +15, 10) end,
           col4= case when (Pos1_col4 > 0)
                      then Substring(Reverse(TextData), Pos1_col4 +1, 1) end,
           col5= case when (Pos1_col5 > 0)
                      then Substring(Reverse(TextData), Pos1_col5 +1, 1) end
      from tbPos2
    )
    --
    SELECT Company,
           col1= case when isDate(col1) > 0 then col1 else NULL end,
           col2= case when col2 in ('DE','MN','NA') then col2 else NULL end,
           col3= case when isDate(col3) > 0 then col3 else NULL end,
           col4, col5
           --,*
      from tbCol;
    go

    TextData: name of the column that contains “string of text data with multiple dates”.
    tbOriginal: table name.

     


     

    -- code #2
    CREATE FUNCTION [dbo].[AllTrim] (@Texto varchar(8000))
    returns table as
    return
    with xTrim as (
    SELECT Rep= 1,
           Trimmed= Replace(Ltrim(Rtrim(@Texto)), space(2), space(1))
    
    union all
    
    SELECT Charindex(space(2), Trimmed),
           Replace(Trimmed, space(2), space(1))
      from xTrim
      where Rep > 0
    )  
    SELECT Trimmed
      from  xTrim 
      where Rep = 0;
    go



        José Diz     Belo Horizonte, MG - Brasil


    • Proposed as answer by Albert_ Zhang Friday, December 30, 2016 3:14 AM
    • Unproposed as answer by Albert_ Zhang Friday, December 30, 2016 3:15 AM
    • Edited by José Diz Sunday, January 1, 2017 5:57 PM
    • Proposed as answer by Albert_ Zhang Thursday, January 5, 2017 12:11 PM
    Thursday, December 29, 2016 10:27 PM
  • Hi oleolehoohoo,

    You could have a look at following demo. In the demo, I assume the last column “col5” just has value “n” or “y”. If not, please let me know. Besides, it uses a function “dbo.UFN_SEPARATES_COLUMNS” that comes from this link and you could have a look at it.

    In my opinion, you’d better first handle your data that has normal format. About the abnormal part, you’d better handle them case by case.

    declare @data table (TextData varchar(max))
    insert into @data values 
    ('company xyz was started on 2016-01-01 and reported MN on 2016-04-01 X Y') ,
    ('company abc was started on 2015-01-01 and reported NA on 2016-01-02 K N') ,
    ('company abcd was started on 2015-01-02 and reported DE on 2015-10-02 X N') , 
    ('company abd was started on 2015-11-17 and reported MN on 2015-12-02 K Y') , 
    ('company mnc was started  CM   Y ') 
    
    select TextData, 
    	case 
    		when CHARINDEX('was started on ',TextData) > 0 then RTRIM(LTRIM(SUBSTRING(textdata, Patindex('%[0-9]%',textdata),len(textdata))))
    		when CHARINDEX('was started ',TextData) > 0 then  RTRIM(LTRIM(REVERSE( SUBSTRING(REVERSE(textdata),0, CHARINDEX(REVERSE( 'was started '),REVERSE(textdata))))))
    		else null 
    	end as newtextdata
    from @data 
    
    
    
    ;with withoutcompanyandstartedon as
    (
    	select TextData, 
    	case 
    		when CHARINDEX('was started on ',TextData) > 0 then RTRIM(LTRIM(SUBSTRING(textdata, Patindex('%[0-9]%',textdata),len(textdata))))
    		when CHARINDEX('was started ',TextData) > 0 then  RTRIM(LTRIM(REVERSE( SUBSTRING(REVERSE(textdata),0, CHARINDEX(REVERSE( 'was started '),REVERSE(textdata))))))
    		else null 
    	end as newtextdata
    	from @data 
    )
    ,withoutreportedonfornormaldata as
    (
    	select *, REPLACE( REPLACE( REPLACE(newtextdata,'and reported ','') ,'on ',''),' ',',') as perdata
    	from withoutcompanyandstartedon
    	where Patindex('%[0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]%',newtextdata)>0
    ) 
    ,normaldata as
    (
    select 
    	TextData,newtextdata
    	,dbo.UFN_SEPARATES_COLUMNS(perdata ,1,',') as col1
    	,dbo.UFN_SEPARATES_COLUMNS(perdata ,2,',') as col2
    	,dbo.UFN_SEPARATES_COLUMNS(perdata ,3,',') as col3
    	,dbo.UFN_SEPARATES_COLUMNS(perdata ,4,',') as col4
    	,dbo.UFN_SEPARATES_COLUMNS(perdata ,5,',') as col5
    from withoutreportedonfornormaldata 
    )
    ,notnormaldata as
    (
    select *
    	,null as col1
    	,case 
    		when patindex('%[a-z][a-z]%',newtextdata) >0 then SUBSTRING(newtextdata,patindex('%[a-z][a-z]%',newtextdata),2) 
    		else null
    	 end as col2
    	,null as col3
    	,CASE 
    		WHEN patindex('% [a-z][^a-z] %',newtextdata)>0 THEN SUBSTRING(newtextdata,patindex('% [a-z][^a-z] %',newtextdata)+1,1)
    		ELSE NULL
    	 END as col4
    	,case 
    		when PATINDEX('% y%',newtextdata) > 0 or PATINDEX('% n%',newtextdata) > 0 then SUBSTRING(newtextdata,LEN(newtextdata),1)
    		else null
    	 end as col5
    from withoutcompanyandstartedon
    where Patindex('%[0-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]%',newtextdata)= 0
    )
    select * from normaldata
    union all
    select * from notnormaldata
    

    Best Regards,

    Albert Zhang


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    • Proposed as answer by Albert_ Zhang Thursday, January 5, 2017 12:12 PM
    Friday, December 30, 2016 5:39 AM
  • Jose and Albert,

    Thanks for the replies. The valid values for the COL5 are YN andCOL2 are DE,MN,NA.

    Friday, December 30, 2016 3:20 PM
  • The valid values for the COL5 are YN andCOL2 are DE,MN,NA.

    What could you tell us about the col4 column?

        José Diz     Belo Horizonte, MG - Brasil


    Friday, December 30, 2016 8:22 PM
  • The valid values for the COL5 are YN andCOL2 are DE,MN,NA.


    What could you tell us about the col4 column?

        José Diz     Belo Horizonte, MG - Brasil


    Thanks Jose. The valid values for column 4 are K/X. However the word company is not literally in the text. Its quoted around double quotation as shown below. This is an now an updated input file.

    "xyz" was started on 2016-01-01 and reported MN on 2016-04-01 X Y 

    "abc" was started on 2015-01-01 and reported NA on 2016-01-02 K N

    "abcd " was started on 2015-01-02 and reported DE on 2015-10-02 X N 

    "abd" was started on 2015-11-17 and reported MN on 2015-12-02 K Y 

    "mnc" was started  CM   Y

    Saturday, December 31, 2016 2:53 PM
  • The valid values for column 4 are K/X. However the word company is not literally in the text. Its quoted around double quotation

    I've changed code #1 above to handle the new information. Please evaluate if the new version meets what you need.

        José Diz     Belo Horizonte, MG - Brasil


    Saturday, December 31, 2016 6:04 PM